Hacker News new | past | comments | ask | show | jobs | submit login

I’ve moved entirely to Polars (which is essentially Pandas written in Rust with new design decisions) with DuckDB as my SQL query engine. Since both are backed by Arrow, there is zero copy and performance on large datasets is super fast (due to vectorization, not just parallelization)

I keep Pandas around for quick plots and legacy code. I will always be grateful for Pandas because there truly was no good dataframe library during its time. It has enabled an entire generation of data scientists to do what they do and built a foundation — a foundation which Polars and DuckDB are now building on and have surpassed.




(Taken from an old comment of mine)

If you were to say “pandas in long format only” then yes that would be correct, but the power of pandas comes in its ability to work in a long relational or wide ndarray style. Pandas was originally written to replace excel in financial/econometric modeling, not as a replacement for sql. Models written solely in the long relational style are near unmaintainable for constantly evolving models with hundreds of data sources and thousands of interactions being developed and tuned by teams of analysts and engineers. For example, this is how some basic operations would look.

Bump prices in March 2023 up 10%:

    # pandas
    prices_df.loc['2023-03'] *= 1.1

    # polars
    polars_df.with_column(
        pl.when(pl.col('timestamp').is_between(
            datetime('2023-03-01'),
            datetime('2023-03-31'),
            include_bounds=True
        )).then(pl.col('val') * 1.1)
        .otherwise(pl.col('val'))
        .alias('val')
    )
Add expected temperature offsets to base temperature forecast at the state county level:

    # pandas
    temp_df + offset_df

    # polars
    (
        temp_df
        .join(offset_df, on=['state', 'county', 'timestamp'], suffix='_r')
        .with_column(
           ( pl.col('val') + pl.col('val_r')).alias('val')
        )
        .select(['state', 'county', 'timestamp', 'val'])
    )
Now imagine thousands of such operations, and you can see the necessity of pandas in models like this.


Point taken but most data wrangling these days — especially at scale — is of the long and thin variety (what is also known as 3rd normal form or tidy format — which actually allows for more flexibility if you think in terms of coordinatized data theory) where aggregations and joins dominate column operations (Pandas’ also allows array like column operations due to its index but there are other ways to achieve the same thing).

I typical do the type of column operation in your example only on subsets of data, and typically I do it in SQL using DuckDB. Interop between Polars and DuckDB is virtually zero cost so I seamlessly move between the two. And to be honest I don’t remember the last time I needed to do this but that’s just the nature of my work and not a generalized statement.

But yes if you are still in a world where you need to perform Excel like operations then I agree.


This is far more elegant in pandas due to the implicit behavior of the index.

But you can move the explicitness of polars behind a function. A more explicit API should not hurt maintainability if we structure our code right.


So something like this?

    def add(df1, df2, meta_cols, val_cols=None):
        # join on meta cols
        # add val cols (default to all non meta cols if None)
        # return df with all meta and val cols selected
In theory I think that's fine. The problem is that in practice this will cause a lot of visual noise in your models, since for every operation you would need to specify, at least, your meta columns, and potentially value columns too. If you change the dimensionality of your data, you would need to update everywhere you've specified them. You could get around this a bit by defining the meta columns in a constant, but that's really only maintainable at a global module level. Once you start passing dfs around, you'll have to pass the specified columns as packaged data around with the df as well. There's also the problem that you'd need to use functions instead of standard operators.

One thing that would be nice to do is set an (and forgive me, I understand the aversion to the word "index") index on the polars dataframe. Not a real index, just a list of columns that are specified as "metadata columns". This wouldn't actually affect any internal state of the data, but what it would do is affect the path of certain operations. Like if an "index" is set, then `+` does the join from above, rather than the current standard `+` operation.

In any case I realize this is a major philosophical divergence from the polars way of thinking, so more just shooting shit than offering real suggestions.


Now imagine the other side of this equation, where pandas seems too clunky, behold YOLOPandas https://pypi.org/project/yolopandas/ i.e. `df.llm.query("What item is the least expensive?")`


DuckDB is not backed by Arrow, but by something similar enough that it can be zero copy to Arrow in many cases https://duckdb.org/2021/12/03/duck-arrow.html


TFA: Pandas 2.0 is also backed by Arrow as an option, yielding a large speed improvement, although not all the way to Polars


How well does Polars play with many of the other standard data tools in Python (scikit learn, etc.)? Do the performance gains carry over?


It works as well as Pandas (realize that scikit actually doesn’t support Pandas — you have to cast a dataframe into a Numpy array first)

I generally work with with Polars and DuckDB until the final step, when I cast it into a data structure I need (Pandas dataframe, Parquet etc)

All the expensive intermediate operations are taken care of in Polars and DuckDB.

Also a Polars dataframe — although it has different semantics — behaves like a Pandas dataframe for the most part. I haven’t had much trouble moving between it and Pandas.


You do not have to cast pandas DataFrames when using scikit-learn, for many years already. Additional in recent version there has been increasing support for also returning DataFrames, at least with transformers and checking column names/order.


Yes that support is still not complete. When you pass a Pandas dataframe into Scikit you are implicitly doing df.values which loses all the dataframe metadata.

There is a library called sklearn-pandas which doesn’t seem to be mainstream and dev has stopped since 2022.


From pandas-dataclasses #166 "ENH: pyarrow and optionally pydantic" https://github.com/astropenguin/pandas-dataclasses/issues/16... :

> What should be the API for working with pandas, pyarrow, and dataclasses and/or pydantic?

> Pandas 2.0 supports pyarrow for so many things now, and pydantic does data validation with a drop-in dataclasses.dataclass replacement at pydantic.dataclasses.dataclass.

Model output may or may not converge given the enumeration ordering of Categorical CSVW columns, for example; so consistent round-trip (Linked Data) schema tool support would be essential.

CuML is scikit-learn API compatible and can use Dask for distributed and/or multi-GPU workloads. CuML is built on CuDF and CuPY; CuPy is a replacement for NumPy arrays on GPUs with 100x relative performance.

CuPy: https://github.com/cupy/cupy :

> CuPy is a NumPy/SciPy-compatible array library for GPU-accelerated computing with Python. CuPy acts as a drop-in replacement to run existing NumPy/SciPy code on NVIDIA CUDA or AMD ROCm platforms.

https://cupy.dev/ :

> CuPy is an open-source array library for GPU-accelerated computing with Python. CuPy utilizes CUDA Toolkit libraries including cuBLAS, cuRAND, cuSOLVER, cuSPARSE, cuFFT, cuDNN and NCCL to make full use of the GPU architecture.

> The figure shows CuPy speedup over NumPy. Most operations perform well on a GPU using CuPy out of the box. CuPy speeds up some operations more than 100X. Read the original benchmark article Single-GPU CuPy Speedups on the RAPIDS AI Medium blog

CuDF: https://github.com/rapidsai/cudf

CuML: https://github.com/rapidsai/cuml :

> cuML is a suite of libraries that implement machine learning algorithms and mathematical primitives functions that share compatible APIs with other RAPIDS projects.*

> cuML enables data scientists, researchers, and software engineers to run traditional tabular ML tasks on GPUs without going into the details of CUDA programming. In most cases, cuML's Python API matches the API from scikit-learn.

> For large datasets, these GPU-based implementations can complete 10-50x faster than their CPU equivalents. For details on performance, see the cuML Benchmarks Notebook.

FWICS there's now a ROCm version of CuPy, so it says CUDA (NVIDIA only) but also compiles for AMD. IDK whether there are plans to support Intel OneAPI, too.

What of the non-Arrow parts of other pandas-compatible and not pandas-compatible DataFrame libraries can be ported back to Pandas (and R)?


My experience is that Polars generally isn't supported by third party libraries, hopefully tjst'll change soon but in the meantime it has some pretty snappy to and from pandas functionality, so converting when you hit a library that needs pandas often still brings a good deal of speedups.


To be honest I haven’t actually worked with any third party libraries that actually supported Pandas directly — most ML libs like Scikit require Numpy arrays (not Pandas dataframes) as inputs so I’ve always had to cast to Numpy from Pandas anyway.

But yes, Polars and DuckDB can easily cast to Pandas and also read Pandas dataframes in memory. I have some legacy data transformations that are mostly DuckDB but involve some intermediate steps in Pandas (because I didn’t want to rewrite them) and it’s all seamless (though not zero copy as it would be in a pure Arrow workflow).

And ironically DuckDB can query Pandas dataframes faster than Pandas itself due to its vectorized engine.


You want to follow all the news related to Polars, have a look and star the Awesome Polars List ! https://github.com/ddotta/awesome-polars


> It has enabled an entire generation of data scientists to do what they do

The SQL you're using finally in 2023 has enabled data scientists to do what they do for decades. Pandas was a massive derailment and distraction in what otherwise would have been called progress.


To be fair pandas was never meant to be a replacement for sql, it was meant to be a replacement for excel in financial models. Which it still excels at (pun intended).


How are you running your SQL queries?

If you use notebooks: my team is working on JupySQL, a tool to improve the SQL experience in Jupyter. https://github.com/ploomber/jupysql


I’m using DuckDB in Jupyter and Python. DuckDB is the SQLite equivalent for complex analytic queries on columnar data (Parquet, CSV etc.)


As someone who mainly uses pandas, what is the benefit of using DuckDB to write your queries over using pandas (or polars) to operate on the data. Is it that you can already subset the data without loading it into memory?


I use DuckDB because I can express complex analytic queries better using a SQL mental model. Most software people hate SQL because they can never remember its syntax, but for data scientists, SQL lets us express our thoughts more simply and precisely in a declarative fashion — which gets us query plan optimization as a plus.

People have been trying to get rid of SQL for years yet they only end up reinventing it badly.

I’ve written a lot of code and the two notations I always gravitate toward are the magrittr + dplyr pipeline notation and SQL.

The chained methods notation is a bit too unergonomic especially to express window functions and complex joins.

Spark started out with method chaining but eventually found that most people used Spark SQL.


In addition to this here's one really specific case: ever had a pandas groupby().apply() that took forever often mostly re-aggregating after the apply?

With columnar data DuckDuckGo is somuchfaster at this.

For one of my projects I have what sounds like a dumb workflow: - JSON api fetches get cached in sqlite3 - Parsing the JSON gets done with sqlite3 JSON operators (Fast! Fault tolerant! Handles NULLs nicely! Fast!!). - Collating data later gets queried with duckdb - everything gets munged and aggregated into the shape I want it and is persisted in parquet files - When it's time to consume it duckdb queries my various sources, does my (used to be expensive) groupbys onthefly and spits out pandas data frames - Lastly those data frames are small-ish, tidy and flexible

So yeah, on paper it sounds like these 3 libraries overlap too much to be use at the same time but in practice they can each have their place and interact well.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: