Hacker News new | past | comments | ask | show | jobs | submit login

They absolutely are faster, because there are optimizations available on data frames that are impossible on RDDs.

Pushdown is the most obvious one. If I don't know what data store is underlying your RDD, I don't know your schema, and I don't know what column you're projecting, pushdown is impossible. I can't know that with an RDD, because all I know when you call map is that you're converting from type A to type B.

Dataframes make that class of optimization possible, because they have more information (your schema, the underlying store), and have more limited operations (select a column, not an arbitrary map operation).




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: