MLlib is for operating in RDDs, ML is for operating in DataFrames. RDDs themselv...

kod · on Aug 10, 2016

RDDs are not obsolete. The reason dataframes are faster is exactly because they are more restrictive than RDDs. If you need to run arbitrary code, RDDs are stil more flexible.

dxbydt · on Aug 10, 2016

Since every dataframe has a lazy instance of rdd, and several methods on dataframes simply call the corresponding method on rdd ( eg. foreach ), I am not sure about the faster bit of your assertion.

nl · on Aug 10, 2016

They are faster in the sense that many thing you previously had to do in a RDD lambda are now Dataframe operations (which are optimized by the Catalyst compiler).

So if you want to do one of the operations in the sql.functions package[1] then Dataframes (and Datasets) are very valuable.

If not, then they won't give you much benefit. However, you will get a little improvement because the Tachyon out-of-JVM-memory framework[2] which I don't think RDD version has access to.

[1] http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.ht...

[2] https://dzone.com/articles/Accelerate-In-Memory-Processing-w...

kod · on Aug 10, 2016

They absolutely are faster, because there are optimizations available on data frames that are impossible on RDDs.

Pushdown is the most obvious one. If I don't know what data store is underlying your RDD, I don't know your schema, and I don't know what column you're projecting, pushdown is impossible. I can't know that with an RDD, because all I know when you call map is that you're converting from type A to type B.

Dataframes make that class of optimization possible, because they have more information (your schema, the underlying store), and have more limited operations (select a column, not an arbitrary map operation).

Zephyr314 · on Aug 9, 2016

Thanks for the heads up! We'll make sure to use DataFrames in our next Spark example. As mentioned before, we thought this would be a cool application of Bayesian optimization (especially since the Spark docs advocate the exponentially slower grid search approach as you pointed out), but we are not Spark experts. We appreciate the feedback!