Hacker News new | past | comments | ask | show | jobs | submit login
Pandas on Ray – Make Pandas faster (cs.berkeley.edu)
219 points by dsr12 on March 3, 2018 | hide | past | favorite | 24 comments



The one line of code is

  import ray.dataframe as pd
They've replaced many pandas functions with an identical API that runs actions in parallel on top of Ray, a task-parallel library:

https://github.com/ray-project/ray

Unlike Dask, Ray can communicate between processes without serializing and copying data. It uses a shared-memory object store within Apache Arrow:

http://arrow.apache.org/blog/2017/08/08/plasma-in-memory-obj...

Worker processes (scheduled by Ray's computation graph) simply map the required memory region into their address space.


As a person who LOVES pandas, numpy, scikit, and all things SciPy, I really wish these kinds of posts would take a few seconds to include a link, or maybe just a quick paragraph, to answer one question:

What the %$#@ is Ray?

I make a habit of doing this myself whenever I do a post like this. Sure, I was able to look up Ray from the Riselab and figure this out myself, but I wish I didn't have to.

From the Ray homepage:

Ray is a high-performance distributed execution framework targeted at large-scale machine learning and reinforcement learning applications. It achieves scalability and fault tolerance by abstracting the control state of the system in a global control store and keeping all other components stateless. It uses a shared-memory distributed object store to efficiently handle large data through shared memory, and it uses a bottom-up hierarchical scheduling architecture to achieve low-latency and high-throughput scheduling. It uses a lightweight API based on dynamic task graphs and actors to express a wide range of applications in a flexible manner.

Check out the following links!

Codebase: https://github.com/ray-project/ray Documentation: http://ray.readthedocs.io/en/latest/index.html Tutorial: https://github.com/ray-project/tutorial Blog: https://ray-project.github.io Mailing list: ray-dev@googlegroups.com


When people write obtusely, I simply read it obtusely and move on with my life.

In my imagination, this article is about some guy named Ray teaching Panda's how to run faster.


I agree that they should have included a link to the main repository in the introduction, but the 4th paragraph from TFA describes Ray:

Pandas on Ray is an early stage DataFrame library that wraps Pandas and transparently distributes the data and computation. The user does not need to know how many cores their system or cluster has, nor do they need to specify how to distribute the data. In fact, users can continue using their previous Pandas notebooks while experiencing a considerable speedup from Pandas on Ray, even on a single machine. Only a modification of the import statement is needed, as we demonstrate below. Once you’ve changed your import statement, you’re ready to use Pandas on Ray just like you would Pandas.


It may be describing Ray, but there's no way to figure that out from reading it. Ray as an entity in its own right isn't mentioned at all - I certainly had no idea what it might be, or whether "Pandas on Ray" was just a quirky name with "on Ray" being a movie reference I didn't get or something.


No, the article should start with a link explaining what pandas are.



Ok, it would be quicker, but is it a free lunch? Is there any chance that it would introduce bugs in my code and generate wrong calculations? There's no problem if it does not optimize or even if it crashes. I can always go back to the original version.


> ... 100's of terabytes of biological data. When working with this kind of data, Pandas is often the most preferred tool

"biological data" is a bit vague, but for the data I know to be that big, sequence and array data, it does not naturally have the structure of a dataframe nor is pandas the tool of choice.


That opening gave me a pause too.

Doesn't Pandas target only structured (tabular) in-core datasets? (Unless you use something like Blaze / Dask / Ray on top.)

Does anyone really work with Pandas on "100's of terabytes of biological data"?


Yeah, I tend to use modified hdf5 tools for data that big.


This is exactly my go-to-move as well.

pandas.read_hdf has beaten out ray.dataframe.read_csv in terms of speed on the few files I've just initially tested now. But I imagine the programmable flexibility csvs have over hdfs (I've never used a Unix command to edit a hdf for example) is why this new approach could get some traction.


Try parquet if your data is tabular, pyarrow and related tools are getting parquet up to a pretty comparable speed to hdf5, with arguably more flexibility and a better multithreading story.


How does Ray compare to spark? Is there a reason to use spark once libraries like Dask or Ray become more mature?


RE:spark, we're curious about Ray mostly because of the potential for interactive-time (ms-level) compute for powering user-facing software. RE:dask, we care that Ray interops with the rest of our stack (Arrow). I haven't evaluated Ray-on-pandas, and the Ray was previously focused on powering traditional ML, so again, just first blush on the announce.

I don't think anything is inherent, more about priorities and momentum. For example, Spark devs have been working on cutting latency, and Conda Inc is/was contributing to the Arrow world. I had assumed the pygdf project would get to accelerating arrow dataframe compute before others, so this announce was a pleasant surprise!


One of the advantages for libraries like Dask is that in the world of "many core" architecture, you incur less overhead than spark especially if you want to schedule large work on a single large machine in the cloud. This in turn enables you to transition a single workload from single machine to multi machine in a more seamless fashion.

The fact that Dask also has high level collections which it knows how to parallelize is also interesting. For workloads which are more related to nd-arrays, matrices and scientific computing, my understanding is that is is more efficient than Spark.

The integration with your ecosystem is also important. If you have to ingest from the (Java) big data ecosystem for instance, Spark has had a lot of work put in its integration with it, it just works for the most part.


For Mac and Linux:

    pip install ray
For windows:

    ¯\_(ツ)_/¯



We're super excited by the overall project at Graphistry, and hadn't realized there was a ray-on-pandas component! The first line w/ GPU count is intriguing given what we do :) Super excited to try it on our pandas code.

For node/data hackers: Our team is trying to bring the full & accelerated pydata world to JavaScript. We started with Arrow columnar data bindings (https://github.com/apache/arrow/tree/master/js). Next stop is Plasma bindings in node for zero-copy node<>pydata interop. That enables nearly-free calls from node web etc. apps to accelerated pandas-on-ray. If others are interested in contributing, let me/us know!



I've found splitting the dataframe, and using multiprocessing module with `apply` to compute chunks of data to be quite efficient. One can use `group_by` method for that, or just slice the dataframe.

For example:

    concurrency = 4  # Num of cores
    pool = multiprocessing.Pool(processes=concurrency)
    results = pool.map(fn, df.group(...))   # fn would be a callable for computing on a chunk.
    pool.close()
    return pd.concat(results)


What's not covered are the crucial operations of group-by and reduce!!!


One of the Pandas on Ray coauthors here: We're planning on releasing another post in the next few weeks discussing the technical details around group-bys. Stay tuned!


Could someone compare this to Dask? Would you use the two in different situations?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: