Hacker News new | past | comments | ask | show | jobs | submit login

Here at ParseHub, we're attacking the problem from another angle. Our tool lets users turn semistructured html (even the complicated, interactive kind) into structured data. The user does this by visually describing relationships in and across pages.

The key idea is to make it easy for another party to add the semantics on top of your data. This solves some fundamental issues that you and Cory Doctorow mentioned:

1) The economics equation for tagging now works out. The user that's doing the tagging has an immediate need (and payoff) for doing that tagging.

A corollary of this is that the parts of the web that are most valuable (in the sense that users need them the most) tend to get tagged first.

The following are responses to Cory's essay:

2.1) The person that's doing the tagging is also an end user, so there's an incentive to do the tagging honestly. That doesn't stop the underlying website from lying. But that's an issue with the web in general, and is mitigated by things like SEO penalties, reviews, etc.

2.2) Again, the tagger is the person who benefits from the tagging, so as long as the data is valuable enough, it will be tagged despite laze.

2.3) We haven't overcome human stupidity. Presumably since the person tagging the data needs it, it will be at a "good enough" level to be usable.

2.4) This one doesn't apply; the tagger is a different person.

2.5) 2.6) and 2.7) These are tougher, and we haven't started working on them yet. You have the same problems when trying to consolidate data from multiple sources. One possibility is to have several alternatives and allow searching to choose between them. That's how Bloomberg solves some of these problems, though it does result in fragmentation.

I'd love to talk to you about this some more. You can email me at serge@parsehub.com

Full disclosure: I'm one of the founders of http://www.parsehub.com




This is a really great answer, I appreciate you taking the time to thoughtfully address Doctorow's issues.

2.5-2.7 are really hard problems. I think that lots of people working in the field get lost on these by trying to achieve some sort of perfect model, or by trying to aggregate every possible option into their model, but neither of them have really been terribly satisfactory or provide the kind of subtle decision framework that humans feel comfortable with.


2.5-2.7 might be resolved by decentralization. If everyone could pseudonymously tag anything, and you could ask questions about the tagging of various sets of pseudonyms, these don't seem to be really hard any more. You do need some degree of normalization, but we don't need One True Taxonomy - we can support multiple perspectives on the world. Hopefully, with proper tooling, we can find a few that are sufficiently consistent and useful.


Re 2.5) 2.6) and 2.7) In order to categorize things that enables you to communicate the idea, you need a common denominator that is universal & neutral enough to rely on. I think there's on place where this work is already done for you, it's wikipedia !

Watching the explanation of differential gear https://news.ycombinator.com/item?id=8513209, I thought why not make wikipedia the central axis around which you let the diversity of the semantic web spin at its own pace. If most people agree on this authority (or if you wish convention over configuration mess) things become easily connectable.

In other terms instead of relying on sloppy ontology rely on wikidata_id as the sort of referential association table.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: