Hacker News new | past | comments | ask | show | jobs | submit login

They are faster in the sense that many thing you previously had to do in a RDD lambda are now Dataframe operations (which are optimized by the Catalyst compiler).

So if you want to do one of the operations in the sql.functions package[1] then Dataframes (and Datasets) are very valuable.

If not, then they won't give you much benefit. However, you will get a little improvement because the Tachyon out-of-JVM-memory framework[2] which I don't think RDD version has access to.

[1] http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.ht...

[2] https://dzone.com/articles/Accelerate-In-Memory-Processing-w...




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: