I read your comment several times, but I still don't understand why you think Ha...

programnature · on Jan 10, 2012

Let me unpack it for you then.

There is a reason XML, the semantic web, linked data failed to really change the data world, whereas hadoop did. The reason is computation.

The problem isn't data interchange formats and ideal representations, the problem is being able to compute with data. Distributed computation can then be used to solve all the other problems.

Case in point: Slurping data out of databases. Apache Sqoop leverages the primitives provided by Hadoop, in terms of partitioning and fault tolerance, to make it easier to do massive data transfers out of existing databases.

Another example of a solution coming from the hadoop perspective: Avro. It beats the pants of off XML as a data interchange format, precisely because it makes computing with the data (which is the ultimate point) easier.

Now, there is a reason I called Hadoop the integration point. It is becoming a general purpose computation system, which at the same time is also the datawarehouse for organizational data. So rather than dealing with the details of proprietary commercial systems, programmers can target applications to the open-source hadoop ecosystem, and have those solutions be reusable and customizable on a large scale.

The "publishing solution" would of course deal with access control, business logic, freshness, etc. That is exactly what I'm advocating be built.

Individual pieces of data may not be big data, but the aggregate problem still is. In fact this is exactly the Wolfram Alpha case: tons and tons of little datasets that add up to a lot of headache.

Tobu · on Jan 10, 2012

Actually, Hadoop is seeing adoption because it can be used within a company's data silo. The semantic web and linked data solve heterogeneous data from a technical perspective, but the data still needs to be shared between diverse actors, and that isn't something commercial entities are in the habit of doing.

th0ma5 · on Jan 10, 2012

i think this is unfair to linked data. linked data could be hidden behind layers upon layers of distributed sparql queries much like how the human readable works today, with each entity playing it's part, but with hadoop you have to have like 15 different ports opened up between each box in a whole setup before you can even begin.

programnature · on Jan 10, 2012

Hadoop is an ops and usability disaster. Yet companies large and small are adopting it because it does "something people want".

RDF and ontologies are just more data. Without computation, that data is not useful, and all the things one "could do" with it will not come to pass without a credible computational platform that people actually want to use.

So IMHO I would like to see that community focus less on standards and ontologies and RDF-as-panacea, and and more on the infrastructure needed to put the data to work.

pork · on Jan 10, 2012

Thanks for clarifying; I find this much more insightful.