As a person who LOVES pandas, numpy, scikit, and all things SciPy, I really wish these kinds of posts would take a few seconds to include a link, or maybe just a quick paragraph, to answer one question:
What the %$#@ is Ray?
I make a habit of doing this myself whenever I do a post like this. Sure, I was able to look up Ray from the Riselab and figure this out myself, but I wish I didn't have to.
From the Ray homepage:
Ray is a high-performance distributed execution framework targeted at large-scale machine learning and reinforcement learning applications. It achieves scalability and fault tolerance by abstracting the control state of the system in a global control store and keeping all other components stateless. It uses a shared-memory distributed object store to efficiently handle large data through shared memory, and it uses a bottom-up hierarchical scheduling architecture to achieve low-latency and high-throughput scheduling. It uses a lightweight API based on dynamic task graphs and actors to express a wide range of applications in a flexible manner.
I agree that they should have included a link to the main repository in the introduction, but the 4th paragraph from TFA describes Ray:
Pandas on Ray is an early stage DataFrame library that wraps Pandas and transparently distributes the data and computation. The user does not need to know how many cores their system or cluster has, nor do they need to specify how to distribute the data. In fact, users can continue using their previous Pandas notebooks while experiencing a considerable speedup from Pandas on Ray, even on a single machine. Only a modification of the import statement is needed, as we demonstrate below. Once you’ve changed your import statement, you’re ready to use Pandas on Ray just like you would Pandas.
It may be describing Ray, but there's no way to figure that out from reading it. Ray as an entity in its own right isn't mentioned at all - I certainly had no idea what it might be, or whether "Pandas on Ray" was just a quirky name with "on Ray" being a movie reference I didn't get or something.
Ok, it would be quicker, but is it a free lunch? Is there any chance that it would introduce bugs in my code and generate wrong calculations? There's no problem if it does not optimize or even if it crashes. I can always go back to the original version.
> ... 100's of terabytes of biological data. When working with this kind of data, Pandas is often the most preferred tool
"biological data" is a bit vague, but for the data I know to be that big, sequence and array data, it does not naturally have the structure of a dataframe nor is pandas the tool of choice.
pandas.read_hdf has beaten out ray.dataframe.read_csv in terms of speed on the few files I've just initially tested now. But I imagine the programmable flexibility csvs have over hdfs (I've never used a Unix command to edit a hdf for example) is why this new approach could get some traction.
Try parquet if your data is tabular, pyarrow and related tools are getting parquet up to a pretty comparable speed to hdf5, with arguably more flexibility and a better multithreading story.
RE:spark, we're curious about Ray mostly because of the potential for interactive-time (ms-level) compute for powering user-facing software. RE:dask, we care that Ray interops with the rest of our stack (Arrow). I haven't evaluated Ray-on-pandas, and the Ray was previously focused on powering traditional ML, so again, just first blush on the announce.
I don't think anything is inherent, more about priorities and momentum. For example, Spark devs have been working on cutting latency, and Conda Inc is/was contributing to the Arrow world. I had assumed the pygdf project would get to accelerating arrow dataframe compute before others, so this announce was a pleasant surprise!
One of the advantages for libraries like Dask is that in the world of "many core" architecture, you incur less overhead than spark especially if you want to schedule large work on a single large machine in the cloud. This in turn enables you to transition a single workload from single machine to multi machine in a more seamless fashion.
The fact that Dask also has high level collections which it knows how to parallelize is also interesting. For workloads which are more related to nd-arrays, matrices and scientific computing, my understanding is that is is more efficient than Spark.
The integration with your ecosystem is also important. If you have to ingest from the (Java) big data ecosystem for instance, Spark has had a lot of work put in its integration with it, it just works for the most part.
We're super excited by the overall project at Graphistry, and hadn't realized there was a ray-on-pandas component! The first line w/ GPU count is intriguing given what we do :) Super excited to try it on our pandas code.
For node/data hackers: Our team is trying to bring the full & accelerated pydata world to JavaScript. We started with Arrow columnar data bindings (https://github.com/apache/arrow/tree/master/js). Next stop is Plasma bindings in node for zero-copy node<>pydata interop. That enables nearly-free calls from node web etc. apps to accelerated pandas-on-ray. If others are interested in contributing, let me/us know!
I've found splitting the dataframe, and using multiprocessing module with `apply` to compute chunks of data to be quite efficient. One can use `group_by` method for that, or just slice the dataframe.
For example:
concurrency = 4 # Num of cores
pool = multiprocessing.Pool(processes=concurrency)
results = pool.map(fn, df.group(...)) # fn would be a callable for computing on a chunk.
pool.close()
return pd.concat(results)
One of the Pandas on Ray coauthors here: We're planning on releasing another post in the next few weeks discussing the technical details around group-bys. Stay tuned!
https://github.com/ray-project/ray
Unlike Dask, Ray can communicate between processes without serializing and copying data. It uses a shared-memory object store within Apache Arrow:
http://arrow.apache.org/blog/2017/08/08/plasma-in-memory-obj...
Worker processes (scheduled by Ray's computation graph) simply map the required memory region into their address space.