Introducing DataFrames in Spark for Large Scale Data Science

super_sloth · on Feb 17, 2015

Has anyone had some good experiences with Spark?

I put several weeks in to moving our machine learning pipeline over to Spark only to find I kept hitting a race condition in their scheduler.

After doing a bit of searching, it seems this is actually a known issue https://issues.apache.org/jira/browse/SPARK-4454 and there's been a fix on their github for a while: https://github.com/apache/spark/pull/3345 and yet in that time two releases have swung by bringing a tonne of features.

I ended up having to drop Spark ultimately because I wasn't confident about putting it in to production (the random OOMs and NPEs during development weren't great either). Does anyone have any positive experiences?

rjurney · on Feb 17, 2015

Spark is less mature than Hadoop, so you will run into issues like this. In my experience, advocating for the bug to get fixed often results in it getting fixed... on a several month timeline. This happened with Avro support in Python. I advocated for the patch and someone supplied it in the next version of Spark.

Lemme tell you though... as someone that has use Hadoop for 5+ years... not waiting 5-10 minutes every time you run new code is worth the trouble. Despite more problems owing to immaturity, or just Spark 'doing less for you' in terms of data validation than other tools like Pig/Hive, if you can get your stuff running on Spark... development is joyous. You just don't have to wait very long during development anymore.

I feel like 5 years of my life were delayed 10 minutes. That did terrible things to my coding that I'm just starting to get over. With Spark I am 10x as productive, and I am limited by my thinking, not the tools.

PySpark in particular is really great.

threeseed · on Feb 17, 2015

Seriously ./spark-shell is a godsend for development.

And I love the fact you can press Tab and get autocompletion of methods.

pwendell · on Feb 17, 2015

Hey - sorry you had a bad experience. That bug was filed as a "minor" issue with only one user ever reporting it, so it didn't end up high up in our triage. We didn't merge the pull request because it was not correct, however, we can just add our own fix for it if it's affecting users. In the future, if you chime in on a reported JIRA, it will escalate it in our process.

super_sloth · on Feb 17, 2015

Sorry, didn't mean to come over completely negative.

I appreciate the work that's gone in to Spark and it's clearly well designed. Developing with Spark after coming from a Hadoop background was a very refreshing experience.

pwendell · on Feb 18, 2015

No worries. Hopefully you'll reconsider using it!

zgm · on Feb 17, 2015

We've recently adopted Spark SQL and our queries run 5 - 200x times faster than with Hive.

Our experience with Spark Streaming, on the other hand, has been mixed. Our streaming app runs stably most of the time (up to 4 days in some cases), but we still see the occasional failure, sometimes with no exception or stack trace indicating what failed.

Our goal is to have a 24/7 streaming service, and Spark has gotten us close to that. There are just a couple of unexplained errors standing in our way.

agibsonccc · on Feb 17, 2015

I'm building distributed ml on top of spark and found it to be good overall. I've had to work out issues with partitioning and mini batching, but I've had a good time so far. The data frame initiative will certainly help things. The JVM ecosystem needs a scientific environment like python (pandas,scipy,..). The potential is there with scala as we're seeing here today.

pixelmonkey · on Feb 17, 2015

Our experience was as a Python shop who was backed into a corner to use Apache Pig for our Hadoop batch jobs.

We decided to rewrite some of those jobs from Pig to PySpark, and though there was a little bit of a learning curve and some sharp edges, the development experience is so much better than Pig that my team is generally happy with the switch.

rjurney · on Feb 17, 2015

PySpark is really compelling for Pig/Python shops. If it weren't for Pig on Spark, I'd fear for Pig's future.

threeseed · on Feb 17, 2015

Spark is pretty fantastic from our perspective. People just think about it in terms of a faster Hadoop MR but it is so much more. The APIs and integration with external systems are so much easier and more intuitive to use.

It really is Hadoop 2.0.

rkrzr · on Feb 17, 2015

We've been using it in production for about half a year now and it's been great. It especially shines when you have iterative algorithms were you e.g. first need to group-by something then process that a bit, then split it out again and process that a bit more etc. This kind of task is just so much faster in Spark than in MapReduce, it doesn't even compare. Also their API is much nicer and so are the underlying ideas (RDDs in particular). I think it will mostly replace MR in the future. If you are starting a new project now I see few reasons to not use Spark if it fits the bill.

TallGuyShort · on Feb 17, 2015

I've used Spark quite successfully for a few small jobs and I love it. I'm sure robustness will improve over time (haven't been bothered by any immaturity myself), but as a system that supports a variety of Big Data processing styles like batch, streaming, graph, ML, etc. so well and is clearly bridging the gap between distributed systems and more traditional analysis languages, I think it's very exciting.

nchammas · on Feb 17, 2015

Btw, there is an new PR by Josh Rosen to fix SPARK-4454 here: https://github.com/apache/spark/pull/4660

kod · on Feb 17, 2015

I've been successfully using Spark in production since 0.7, across three or four significantly different projects.

I don't think I could bring myself to ever write another Hadoop job.

eranation · on Feb 18, 2015

I do, mostly using GraphX, ran a pretty big EC2 cluster using Spark 1.2 (hundreds of cores) and had no significant issues.

elliptic · on Feb 18, 2015

Spark the platform seems awesome. I'm somewhat less convinced by mllib - I'm not sure there are as many use cases for distributed machine learning as people seem to think (and I would bet that a good deal of companies that use distributed ML don't really need it). I've seen a lot of tasks that could be handled by simpler, faster algos on large workstations (you can get 250 GB RAM from AWS for like $4.00/hr). I'd love to hear counterarguments, though!

ogrisel · on Feb 18, 2015

While fitting the algorithm might not often benefit from partitioned data, I see two upsides from using spark for predictive modeling.

First it makes it easy to do the feature extraction and model fitting in the same pipeline, hence make it possible to cross-validate the impact of the hyper-parameters of the feature extraction part. Feature extraction generally starts from a collection of large, raw datasets that needs to be filtered, joined and aggregated (for instance a log of user clicks, sessionized by user id over temporal windows, then geo-joined to GIS data via a geoip resolution of the IP address of the user agent). While the raw datasets of clicks and geographical databases might be too big to be processed efficiently on a single node, the resulting extracted features (e.g. user session statistics enriched with geo features) is typically much slower and could be processed on a single node to build a predictive model. However spark RDDs make it natural to trace the provenance hence trivial to rebuild downstream models when tweaking upstream operations used to extract the features. The native caching features of Spark make that kind of workflow very efficient with minimal boilerplate (e.g. no manual file versionning).

Second, while the underlying ML algorithm might not always benefit from parallelization in itself, there are meta-level modeling operations that are both CPU intensive and embarrassingly parallel and therefore can benefit greatly from a compute cluster such as Spark. The canonical case are cross validation and hyper-parameter tuning.

noelwelsh · on Feb 18, 2015

The benefit of Spark and related systems is you get a flexible infrastructure that can handle a wide range of tasks reasonably well. You pay once for infrastructure, training, devops, and so on.

You can optimise any particular use case to perform better than Spark, but then you are going to incur the above costs for every project you create.

rxin · on Feb 17, 2015

I'm one of the authors of the blog post as well as this new API. Feel free to ask me anything.

bonobo3000 · on Feb 17, 2015

Have there been any changes to the in-memory columnar caching used by SchemaRDDs in 1.2? I noticed some problems with that, for example if a SchemaRDD with cols [1,2,3] on parquet files [X,Y,Z] is cached, and then I create a new one with a subset of the cols say [1,2] on the same files [X,Y,Z], the new SchemaRDDs physical plan would refer to the files on disk instead of an in memory columnar scan. I'm wondering if DataFrames handle this differently and implications for caching.

For some context - In our case, loading a reasonable set of data from HDFS can take upto 10-30 mins so keeping a cached copy of the most recent data with certain columns projected is important.

data_scientist · on Feb 18, 2015

One useful feature is to consider the neighbors rows, for example for differentiating a time series. Do you plan to make an efficient method for that?

hadley · on Feb 18, 2015

How does the API communicate between the client and the server? Any interest in talking to it from R? (I'd be happy to help)

djcater · on Feb 18, 2015

Are there any timelines for when this (and Spark in general) will fully support ORC files (including predicate-pushdown)?

pwendell · on Feb 18, 2015

Very likely in Spark 1.4. Hortonworks has been helping out with this, we just need some internal refacotring to the API to make it work.

century19 · on Feb 17, 2015

Are DataFrames RDDs with a new DSL?

rxin · on Feb 17, 2015

In a way yes. It is a little bit more than that because DataFrames internally are actually "logical plans". Before execution, they are optimized by an optimizer called Catalyst and turn into physical plans.

century19 · on Feb 17, 2015

Normal RDDs won't benefit from this optimisation, only DataFrames? Is that because using this new DSL allows Spark to more precisely plan what needs to happen for DataFrames?

I guess this means DataFrames should be used all the time in the future, or will there still be a reason to use plain RDDs in the future?

You guys are doing great work !

rxin · on Feb 17, 2015

Indeed, DataFrames give Spark more semantic information about the data transformations, and thus can be better optimized. We envision this to become the primary API users use. You can still fall back to the vanilla RDD API (afterall DataFrame can be viewed as RDD[Row]) for stuff that is not expressible with DataFrames.

rkrzr · on Feb 18, 2015

Could you give an example of something that could not be expressed with DataFrames? Would e.g. tree-structured data be a bad fit for DataFrames, since it doesn't fit well with the tabular nature?

rjurney · on Feb 17, 2015

Hey... don't downvote Reynold Xin, author of the post as dupe when he says AMA.

rxin · on Feb 17, 2015

Somehow my comment was removed ... :(

rjurney · on Feb 17, 2015

Re-make it.

homerowilson · on Feb 17, 2015

A replica of this benchmark on my laptop running R has this running in about 1/4 second. Seems like a pretty trivial benchmark?

library(data.table)

x = data.table(a=sample(10,10e6,replace=TRUE),num=sample(100,10e6,replace=TRUE)) t1=proc.time(); x[,sum(num),by=a]; print(proc.time()-t1)

   user  system elapsed
  0.209   0.032   0.245

rxin · on Feb 17, 2015

The example was mostly a toy example. The power really comes when you get interactivity for small data and big data. Using this, you could scale up to TBs of data on a cluster and still get results relatively fast, which is not something you can do with R.

I don't expect at small scale to beat R yet. There are a few low-hanging fruits for single node performance. For example, even for single node data, we incur a "shuffle" to do data exchange in aggregations. This is done to ensure both single node program and distributed program go through the same code path, to catch bugs. If we want to optimize more for single node performance, we can get the optimizer to remove the shuffle operation in the middle, and just run the aggregations. Then this toy example will probably be done in the 100ms range.

rxin · on Feb 18, 2015

Actually I just realized the main difference -- if you load the data from memory, it is much faster, which the case of R.

eranation · on Feb 18, 2015

Native question - how is this different than Spark SQL and things like project zeppelin?

jyotiska · on Feb 18, 2015

This is great news! Where do I see the source for this?

zodvik · on Feb 18, 2015

Since this will be released in Spark 1.3, you can track the dev. in the 1.3 branch on Github https://github.com/apache/spark/tree/branch-1.3

zodvik · on Feb 17, 2015

Will the DataFrame API work with Spark Streaming?

rxin · on Feb 17, 2015

There are some ways you can integrate the two. E.g. Streaming allows you to apply arbitrary RDD transformations, and thus you can pass a physical plan generated by DataFrame into streaming.

We will work on better integration in the future too.