If your task is too big for pandas, you should probably skip right over dask and...

__mharrison__ · 2024-08-29T16:20:47 1724948447

Jump straight to cluster and skip Dask?

Not sure what "big" means here, but a combination of .pipe, pyarrow, and polars can speed up many slow Pandas operations.

Polars streaming is surprisingly good for larger than RAM. I get that clusters are cool, but I prefer to keep it on a single machine if possible.

Also, libraries like cudf can greatly speed up Pandas code on a single machine, while Snowpark can scale Pandas code to Snowflake scale.

cpcloud · 2024-08-29T16:29:15 1724948955

In my experience, Polars streaming runs out of memory at much smaller scales than both DuckDB and DataFusion and tends to use much more memory for the same workload when it doesn't outright segfault.

Polars is faster than those two once you get to less than a few GB, but beyond that you're better off with DuckDB or DataFusion.

I would love for this to improve in Polars, and I'm sure it will!

ritchie46 · 2024-08-29T16:58:33 1724950713

Do you mean segfault or OOM? I am not aware of Polars segfaulting on high memory pressure.

If it does segfault, would you mind opening an issue?

Some context; Polars is building a new streaming engine that will eventually be ready to run the whole Polars API (Also the hard stuff) in a streaming fashion. We expect the initial release end of this year/early next year.

Our in-memory engine isn't designed for out-of-core processing and thus if you benchmark it on restricted RAM, it will perform poorly as data is swapped or you go OOM. If you have a machine with enough RAM, Polars is very competitive in performance. And in our experience it is tough to beat in time-series/window functions.

cpcloud · 2024-08-29T17:10:11 1724951411

Segmentation violations are often the result of different underlying problems, one of which can be running out of memory.

We (the Ibis team) have opened related issues and the usual response is to not use streaming until it's ready, or to fix the problem if it can be fixed.

Not sure what else there is to do, seems like things are working as expected/intended for the moment!

We'll definitely be the first to try out any improvements to the streaming engine.

ritchie46 · 2024-08-29T17:34:55 1724952895

They have different implications for us. An abort due to an OOM isn't a bug in our program, as SEGFAULT is a serious bug we want to fix.

ledauphin · 2024-08-30T03:20:57 1724988057

when you say "our in-memory engine", are you talking about dataframe or the lazyframe?

ritchie46 · 2024-08-30T05:42:27 1724996547

A DataFrame is our in memory table. A LazyFrame is a compute plan that can have DataFrames as source.

The engine is what executes our plans and materializes a result. This is plural as we are building a new one.

__mharrison__ · 2024-08-29T16:32:38 1724949158

My understanding is that the Polars team is working on a new streaming engine. It looks like you will get your wish.