More

wsprague · on Nov 16, 2009

My thoughts about Tk:

1. I think buttons with relief are easier to use, but other than that I am always surprised at how worked up people get about theming (but then I prefer a terminal based emacs...).

2. Tcl/Tk has some GREAT ideas that are not implemented anywhere else: the built-in event handler, trivial to write callbacks, and the everything is a string philosophy (which is strangely like a self-evaling lisp approach)

3. Tcl/Tk had some really bad luck. This weekend at opensqlcamp, someone asked Selena Deckleman why PostgreSQL has not had a tenth the market share as MySQL. She said (paraphrasing): in 1999 a lot of the core devs thought the internet was just a fad -- whoops... Same mistakes maybe happened with Tcl/Tk -- it was there just a little too early.

4. Tcl/Tk demands C to be a really full platform, and C is scary to a lot of the new breed of scripters.

davidw · on Nov 16, 2009

Good points, but I'm not sure about the last one. There are plenty of people who use pretty much just Tcl and don't dive into C. This even surprised Ousterhout when he was still involved with things; he thought of it as a glue language when he first created it (in the late 80ies).

wsprague · on Nov 13, 2009

"Murray and Strout don't understand human nature. People -- especially teenagers -- don't like following pointless rules."

Au Contraire -- Murray and Strout understand fully, but as employees of the public school system, Murray and Strout are paid to teach their students to follow pointless rules with docility and even joy (or at least the feeling that they are doing the right thing).

(Edit -- I consider it a compliment that I get upvoted and downvoted so often. Teenage rebel gives me a point... humorless school principal takes it away...)

NathanKP · on Nov 13, 2009

That is true. School is in part supposed to teach you to live with requirements. After all it should be preparing students for the workforce, and in the real world if you are working for an employer they might have similar strange and seemingly unreasonable requirements.

Yet I guess that the hackers and independent entrepreneurs among us can not help but sympathize with a rebellion against the status quo, even if it is over something as trivial as the word "meep". What else would cause a grown adult to send a "meep" email to a school principal?

I just find it funny that the thing has turned into a holy war of sorts.

wsprague · on Nov 13, 2009

The problem with data.gov and sites like it is that they are built on faulty premises about data:

1. Fiction: Data doesn't require lots of work to make it useable, so we can just upload whatever we have and it will be useful to somebody. Fact: the big useable datasets (census, ipums, nlsy, all the private marketing datasets) have armies of people cleaning and integrating them. It costs money, it takes time, and it is easy to screw up.

2. Fiction: Links are worth something. Fact: links are worthless.

3. Fiction: XML adds values. Fact: aascii tab delimited in consistent formats add value, while XML SUBTRACTS value.

4. Fiction: a good dataset is easy to use. Fact: even a good dataset (google IPUMS for an example) takes a lot of work to get to know how to manipulate, presuming one can use some sort of statistical programming language in the first place.

5. Fiction: simple summaries of common data data are useful. Fact: everybody has already done the simple summaries. (This is just a bonus item, and doesn't apply to data.gov, but does apply to faulty thinking about data in general.)

6. Fiction: Federated data is just fine. Fact: Data that is curated, cleaned, and integrated into one big monolithic package is FAR better, because an analyst can then learn the conventions and names and such in one piece, and parallel categories are more likely to align.

7. Fiction: Good data is easy for a layperson to use. Fact: good data still requires a lot of skill. Well, maybe in nations with decent public schools a layperson can do something with data, but not in the US.

What I WOULD like is the following (taken from another post, now deleted):

An ideal data.gov would have a lot of staff who put together a few integrated and curated datasets from the agencies. These would be hierarchies of data in a few formats (shp, txt, raster, SQL text dump, and ...?), along with well written codebooks and narrative READMEs. They would be distributed using git or subversion. The staff would have the expertise to make such nice data packages for you and me, and they would have the political oomph to demand that the agencies release the data to them. The staff would also give classes on how to use the data along in some open source statistical packages to do useful work. Good examples of curated data that I know are IPUMS and the Portland Metro's RLIS (both google-able).

jfager · on Nov 13, 2009

I don't understand what you're getting at with this list.

1. Yes, datasets need to be cleaned. But you need to have the dataset before you can clean it, and different people will want to clean it in different ways. Get it up there first, and keep the political debates confined to the gathering methods. Griping about raw datasets only gives them an excuse to keep delaying putting anything out (in other words, this critique is actively harming the movement, please stop making it).

2. I don't understand what you mean by this. If a link points to a high-quality dataset that's otherwise hard to find, then it's very valuable.

3. Not all data is expressible in tab-delimited ascii tables. I'd like my SEC filings in well-structured XML, for instance.

4. This is a strawman. Nobody serious has ever said a good data set is easy to use and understand.

5. Ironically, this is the one point you make I agree with, and then you claim it doesn't apply to data.gov. I think this is actually the worst thing about data.gov right now, that they think they're giving us anything when they post their little summaries. Give us the raw data, please.

6. Isn't this just restating a combination of #1 and #3? Yes, big clean monolithic data sets are nice, but the priority is getting access to the data in the first place.

7. You're restating #4, which was a strawman.

elblanco · on Nov 15, 2009

Well structured XML is almost impossible to beat as a data interchange format (since it was designed for that)...if you can't load XML, a format that's been around since the 1990's, you are using the wrong tools.

wsprague · on Nov 13, 2009

OK, we disagree. Except that #4 IS sort of redundant, though I want to make the point that data is almost impossible for a layperson, and still really hard for a practiced analyst.

jfager · on Nov 13, 2009

I actually meant it when I said I didn't understand what you were getting at. I initially read it as you saying that there shouldn't be a data.gov at all (because raw data's useless, curated data's expensive and difficult, and simplified data summaries are likely to be misinterpreted by lay people), but that can't be right, so I'm really curious what you were actually trying to say. What would an ideal data.gov look like, to you?

on Nov 13, 2009

[deleted]

bbgm · on Nov 13, 2009

That's in a perfect world. The first step is to get the data out there. They we can start wondering about structure and presentation. It takes a long time to build a data infrastructure, but you should stop people who have the skills and interest from getting their hands on it. Hopefully data.gov will work with data producers to become a quality data resource, but at least there is a resource in the first place, a place to go and find material.

Perhaps a model might be people starting from data.gov and then creating different views into the data for different purposes. They don't have to reside on data.gov, and personally that's what I hope the site evolves to. Making data available in reasonable formats that can then be converted into information by other people.

wsprague · on Nov 13, 2009

I moved my reply to my top comment and screwed up the reply tree here. So this is for the comment that follows this one:

data.gov is fundamentally flawed, and won't be anything but annoying until it is reworked into something along the lines of what I suggest. Or so I think...

jfager · on Nov 13, 2009

Responding to your edit: I don't think data.gov should be in the business of curating too aggressively. It delays data being released, and it brings in a lot of questions about the politics of how that curation is done. I agree, though, on the need for a variety of release formats, detailed descriptions of the data, open and versioned repositories, and the political clout to get the data all in one place.

I think of data.gov as a layer below IPUMS - you go to data.gov to source your IPUMS-like project (or your company that's built around doing the detailed curation you're talking about).

wsprague · on Nov 13, 2009

Interesting point, though I more or less disagree about the level of curation necessary to make a minimally useful product. I think data.gov is FAR below that level, and is just noise so far. But these are practical questions, and I think we agree in principal about a lot.

_zhqs · on Nov 13, 2009

What do you mean by the third point?

3. Fiction: XML adds values. Fact: aascii tab delimited in consistent formats add value, while XML SUBTRACTS value.

wsprague · on Nov 13, 2009

I mean that for all practicing data analysts that I know, XML is a pain in the ass (parsers, xpaths, etc, all to get it into a csv that you can import), while nice ascii text is easy to work with.

If you want metadata, a well written narrative paragraph along with a code book is INFINITELY better than embedding the metadata in the data.

Furthermore, a lot of supposed metadata in XML is just dross like "<column>blah</column>".

Finally, all the crap in XML way ups the signal/ noise ratio; if you do need something that maps to a complex data structure use JSON or something rational. Such needs are not very common in data analysis, in contrast to web applications; data analysts use multiple tables and are usually pretty close to relational databases and SQL (even if they don't call it that).

physcab · on Nov 13, 2009

Also, its incredibly difficult to deal with large (>100mb) datasets in XML format. Loading that thing into RAM for an XML parser is ridiculous. Tab delimited data is really the best format possible as you can easily build MapReduce scripts if needed to manipulate it.

llimllib · on Nov 13, 2009

I almost always write my own stream parser with regular expressions to deal with large XML files (especially very regular ones), though it should be noted that there are stream XML parsers.

physcab · on Nov 14, 2009

Knowing regular expressions is an all around good idea when doing data processing. Steep learning curve but pays itself off in increased productivity.

What stream XML parsers do you use? I just get my data ready for Hadoop and let it go.

llimllib · on Nov 14, 2009

To be honest, I just kind of think I know that there are stream XML parsers? I've used cElementTree when I have small XML documents and written my own regex for larger ones. (cElementTree is definitely not a stream parser)

kurtosis · on Nov 13, 2009

I can imagine some circumstances where the hierarchical structure of XML would be useful, but in just about every data processing job I've undertaken that involved XML my first step was to get rid of XML and convert it to something like .csv or ascii tab delimited.

wsprague · on Nov 13, 2009

If you need hierarchies, use JSON, IMHO, or keys that reference between tables (the census PUMS data does this with persons nested within households, using two tables).

spiffage · on Nov 14, 2009

data.gov isn't intended to be any more than a data layer. If you want stuff that a human can consume, turn to one of the data.gov challenge contestants ( http://sunlightlabs.com/contests/appsforamerica2/apps/ ) or http://analyzethe.us .

wsprague · on Nov 11, 2009

The only thing I HATE about Python is the mandatory indentation. The problems with it are:

1. It mixes format with logic. Don't we all know this is one of the greatest evils one can commit?

2. Indentation is a bitch to parse compared with braces. Violates occam's razor of software of engineering -- simpler is better. Unnecessary complexity is the second greatest evil in software engineering.

3. Indentation is neither symmetrical nor logically unique. Therefore you can't delete all the formatting from a python program and expect a formatter to get it right. With symmetric braces a formatter always knows exactly what logical level it should be on by counting the number of closed braces; not possible with python's goofy system.

4. All that tiresome debate and worry over tabs versus spaces.

5. One can't really say "braces are redundant not indents" or "indents are redundant not braces". Redudency is commutative. I would say braces are for computers, indents are from human; it is simple to go from braces to indents but not so in the other direction.

Just a note -- I will always pick Python over any other scripting language. I just think the indentation is a serious mistake; there are so few other bad choices it tends to irk me that much more.

nailer · on Nov 12, 2009

Indentation is logic. Repeating logic is one of the greatest evils one can commit. http://en.wikipedia.org/wiki/Dont_repeat_yourself

Re: parsing, read each line, if the indent is greater than the current indentlevel, add line to a new block, else, add line to existing block.

Tabs were considered harmful before Python, they're still considered harmful now.

You could add braces to Python quite easily, I think this has already been done in jest a few times...

wsprague · on Nov 10, 2009

I would like to see this supported (or not, methinks) by empirical evidence. A priori arguments without evidence aren't worth a damn, in my estimation.

wsprague · on Nov 10, 2009

I upvoted you because it is a good point, but ... who is advancing the most quickly in manufacturing these days? That would be China....

Plus, I would hesitate to naturalize the idea of intellectual property as if it were a "thing" just like a cheese sandwich. Some of your argument seems to rest on protecting IP which (somehow) exists; a lot of people would doubt that IP is a valid concept or exists at all.

kwamenum86 · on Nov 10, 2009

I would offer that in any regulated system you can excel by ignoring the regulations if others insist on adhering to them.

ZeroGravitas · on Nov 10, 2009

And the "regulation" in this case is for China and other developing nations to pay a tithe to more developed nations because they got there first and planted a flag on some unclaimed (intellectual) property.

wsprague · on Nov 8, 2009

In python, it seems like it would be trivial to add a __test__() method to a class that was run, well, whenever.

wsprague · on Nov 6, 2009

I liked the way he applies moderately useful statistics to a real world application, mostly just as an example of how we all could be using the standard deviation along with the mean.

wsprague · on Nov 4, 2009

I know people who work in social services who have noticed a change in the last generation. Before, you would think twice before telling a working class parent that the child misbehaved because the parent might discipline too harshly. Today, the parents always assume the kids are right when they come home whining about how mean the teacher was. Don't know which is better or worse....

wsprague · on Nov 4, 2009

Levi Strauss is not only probably one of the two most famous anthropologists of all time (Magaret Meade being the other), but his work is cited by many of the foundational thinkers of cognitive science.

His book "Tristes Tropiques" was also so well written that it would have been awarded the Prix Goncourt except that it was non-fiction.

Probably one of the ten most important thinkers of the 20th century, though not nearly so important today.

http://www.guardian.co.uk/science/2009/nov/03/claude-levi-st...