Hacker News new | past | comments | ask | show | jobs | submit login

The DataFrame is an evolution of the RDD model, where Spark knows explicit schema information. The core Spark RDD API is very generic and assumes nothing about the structure of the user's data. This is powerful, but ultimately the generic nature imposes limits on how much we can optimize.

DataFrames impose just a bit more structure: we assume that you have a tabular schema, named fields with types, etc. Given this assumption, Spark can optimize a lot of internal execution details, and also provide slicker API's to users. It turns out that a huge fraction of Spark workloads fall into this model, especially since we support complex types and nested structures.

Is the core RDD API going anywhere? Nope - not any time soon. Sometimes it really is necessary to drop into that lower level API. But I do anticipate that within a year or two most Spark applications will let DataFrames do the heavy lifting.

In fact, DataFrames and RDDs are completely inter-operable, either can be converted to the other. This means that even if you don't want to use DataFrames you can benefit from all of the cool input/output capabilities they have, even just to create regular old RDDs.




> It turns out that a huge fraction of Spark workloads fall into this model, especially since we support complex types and nested structures.

The first step of all my Spark tasks is "turn this RDD[String] into an RDD of parsed JSON", or turning CSV into case classes.

What JSON parser will dataframes be using? I presume Jackson?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: