Hacker News new | past | comments | ask | show | jobs | submit login

Since every dataframe has a lazy instance of rdd, and several methods on dataframes simply call the corresponding method on rdd ( eg. foreach ), I am not sure about the faster bit of your assertion.



They are faster in the sense that many thing you previously had to do in a RDD lambda are now Dataframe operations (which are optimized by the Catalyst compiler).

So if you want to do one of the operations in the sql.functions package[1] then Dataframes (and Datasets) are very valuable.

If not, then they won't give you much benefit. However, you will get a little improvement because the Tachyon out-of-JVM-memory framework[2] which I don't think RDD version has access to.

[1] http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.ht...

[2] https://dzone.com/articles/Accelerate-In-Memory-Processing-w...


They absolutely are faster, because there are optimizations available on data frames that are impossible on RDDs.

Pushdown is the most obvious one. If I don't know what data store is underlying your RDD, I don't know your schema, and I don't know what column you're projecting, pushdown is impossible. I can't know that with an RDD, because all I know when you call map is that you're converting from type A to type B.

Dataframes make that class of optimization possible, because they have more information (your schema, the underlying store), and have more limited operations (select a column, not an arbitrary map operation).




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: