Hacker News new | past | comments | ask | show | jobs | submit login

I read your comment several times, but I still don't understand why you think Hadoop is the key technology for data interchange between organizations. I don't mean to be harsh, but your comment is a bit like buzzword soup (hadoop, etl, cloud, bring the computation to the data).

> [Hadoop] provides the integration point for both slurping the data out of internal databases, and transforming it into consumable form

Hadoop does no such thing. It doesn't "slurp data out of internal databases". It's just a DFS coupled with a MapReduce implementation. Perhaps you're thinking of Hive?

> Currently there are no solutions for transferring data between different organizations’ hadoop installations.

All data isn't "big data". By being myopically hadoop-focused, you're ignoring the real problem, which is data interchange. XML was supposed to be the golden standard; it's debatable how far it's achieved its initial goal.

> So some publishing technology that would connect hadoop’s HDFS to the .data domain

So basically, forsake all internal business logic, access control, and just pipe your database to the net? When you have a hammer...

> Transferring terabytes of data is non-trivial. But if the data is published to a cloud provider, others can access it without having to create their own copy, and it can be computed upon within the high-speed internal network of the provider

See AWS public datasets for exactly this, but it's still a long shot. It also ignores the problem of data freshness (i.e., once a provider uploads a dataset, they also need to keep updating it). http://aws.amazon.com/publicdatasets/




Let me unpack it for you then.

There is a reason XML, the semantic web, linked data failed to really change the data world, whereas hadoop did. The reason is computation.

The problem isn't data interchange formats and ideal representations, the problem is being able to compute with data. Distributed computation can then be used to solve all the other problems.

Case in point: Slurping data out of databases. Apache Sqoop leverages the primitives provided by Hadoop, in terms of partitioning and fault tolerance, to make it easier to do massive data transfers out of existing databases.

Another example of a solution coming from the hadoop perspective: Avro. It beats the pants of off XML as a data interchange format, precisely because it makes computing with the data (which is the ultimate point) easier.

Now, there is a reason I called Hadoop the integration point. It is becoming a general purpose computation system, which at the same time is also the datawarehouse for organizational data. So rather than dealing with the details of proprietary commercial systems, programmers can target applications to the open-source hadoop ecosystem, and have those solutions be reusable and customizable on a large scale.

The "publishing solution" would of course deal with access control, business logic, freshness, etc. That is exactly what I'm advocating be built.

Individual pieces of data may not be big data, but the aggregate problem still is. In fact this is exactly the Wolfram Alpha case: tons and tons of little datasets that add up to a lot of headache.


Actually, Hadoop is seeing adoption because it can be used within a company's data silo. The semantic web and linked data solve heterogeneous data from a technical perspective, but the data still needs to be shared between diverse actors, and that isn't something commercial entities are in the habit of doing.


i think this is unfair to linked data. linked data could be hidden behind layers upon layers of distributed sparql queries much like how the human readable works today, with each entity playing it's part, but with hadoop you have to have like 15 different ports opened up between each box in a whole setup before you can even begin.


Hadoop is an ops and usability disaster. Yet companies large and small are adopting it because it does "something people want".

RDF and ontologies are just more data. Without computation, that data is not useful, and all the things one "could do" with it will not come to pass without a credible computational platform that people actually want to use.

So IMHO I would like to see that community focus less on standards and ontologies and RDF-as-panacea, and and more on the infrastructure needed to put the data to work.


Thanks for clarifying; I find this much more insightful.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: