Pandas 2.0

mynameisash · on April 3, 2023

I'm curious if there will be any appreciable performance gains here that are worthwhile. FWIW, last I checked[0], Polars still smokes Pandas in basically every way.

[0] https://www.pola.rs/benchmarks.html

wenc · on April 3, 2023

I’ve moved entirely to Polars (which is essentially Pandas written in Rust with new design decisions) with DuckDB as my SQL query engine. Since both are backed by Arrow, there is zero copy and performance on large datasets is super fast (due to vectorization, not just parallelization)

I keep Pandas around for quick plots and legacy code. I will always be grateful for Pandas because there truly was no good dataframe library during its time. It has enabled an entire generation of data scientists to do what they do and built a foundation — a foundation which Polars and DuckDB are now building on and have surpassed.

brahbrah · on April 3, 2023

(Taken from an old comment of mine)

If you were to say “pandas in long format only” then yes that would be correct, but the power of pandas comes in its ability to work in a long relational or wide ndarray style. Pandas was originally written to replace excel in financial/econometric modeling, not as a replacement for sql. Models written solely in the long relational style are near unmaintainable for constantly evolving models with hundreds of data sources and thousands of interactions being developed and tuned by teams of analysts and engineers. For example, this is how some basic operations would look.

Bump prices in March 2023 up 10%:

    # pandas
    prices_df.loc['2023-03'] *= 1.1

    # polars
    polars_df.with_column(
        pl.when(pl.col('timestamp').is_between(
            datetime('2023-03-01'),
            datetime('2023-03-31'),
            include_bounds=True
        )).then(pl.col('val') * 1.1)
        .otherwise(pl.col('val'))
        .alias('val')
    )

Add expected temperature offsets to base temperature forecast at the state county level:

    # pandas
    temp_df + offset_df

    # polars
    (
        temp_df
        .join(offset_df, on=['state', 'county', 'timestamp'], suffix='_r')
        .with_column(
           ( pl.col('val') + pl.col('val_r')).alias('val')
        )
        .select(['state', 'county', 'timestamp', 'val'])
    )

Now imagine thousands of such operations, and you can see the necessity of pandas in models like this.

wenc · on April 3, 2023

Point taken but most data wrangling these days — especially at scale — is of the long and thin variety (what is also known as 3rd normal form or tidy format — which actually allows for more flexibility if you think in terms of coordinatized data theory) where aggregations and joins dominate column operations (Pandas’ also allows array like column operations due to its index but there are other ways to achieve the same thing).

I typical do the type of column operation in your example only on subsets of data, and typically I do it in SQL using DuckDB. Interop between Polars and DuckDB is virtually zero cost so I seamlessly move between the two. And to be honest I don’t remember the last time I needed to do this but that’s just the nature of my work and not a generalized statement.

But yes if you are still in a world where you need to perform Excel like operations then I agree.

ritchie46 · on April 3, 2023

This is far more elegant in pandas due to the implicit behavior of the index.

But you can move the explicitness of polars behind a function. A more explicit API should not hurt maintainability if we structure our code right.

brahbrah · on April 3, 2023

So something like this?

    def add(df1, df2, meta_cols, val_cols=None):
        # join on meta cols
        # add val cols (default to all non meta cols if None)
        # return df with all meta and val cols selected

In theory I think that's fine. The problem is that in practice this will cause a lot of visual noise in your models, since for every operation you would need to specify, at least, your meta columns, and potentially value columns too. If you change the dimensionality of your data, you would need to update everywhere you've specified them. You could get around this a bit by defining the meta columns in a constant, but that's really only maintainable at a global module level. Once you start passing dfs around, you'll have to pass the specified columns as packaged data around with the df as well. There's also the problem that you'd need to use functions instead of standard operators.

One thing that would be nice to do is set an (and forgive me, I understand the aversion to the word "index") index on the polars dataframe. Not a real index, just a list of columns that are specified as "metadata columns". This wouldn't actually affect any internal state of the data, but what it would do is affect the path of certain operations. Like if an "index" is set, then `+` does the join from above, rather than the current standard `+` operation.

In any case I realize this is a major philosophical divergence from the polars way of thinking, so more just shooting shit than offering real suggestions.

tehf0x · on April 3, 2023

Now imagine the other side of this equation, where pandas seems too clunky, behold YOLOPandas https://pypi.org/project/yolopandas/ i.e. `df.llm.query("What item is the least expensive?")`

kylebarron · on April 3, 2023

DuckDB is not backed by Arrow, but by something similar enough that it can be zero copy to Arrow in many cases https://duckdb.org/2021/12/03/duck-arrow.html

RockyMcNuts · on April 3, 2023

TFA: Pandas 2.0 is also backed by Arrow as an option, yielding a large speed improvement, although not all the way to Polars

next_xibalba · on April 3, 2023

How well does Polars play with many of the other standard data tools in Python (scikit learn, etc.)? Do the performance gains carry over?

wenc · on April 3, 2023

It works as well as Pandas (realize that scikit actually doesn’t support Pandas — you have to cast a dataframe into a Numpy array first)

I generally work with with Polars and DuckDB until the final step, when I cast it into a data structure I need (Pandas dataframe, Parquet etc)

All the expensive intermediate operations are taken care of in Polars and DuckDB.

Also a Polars dataframe — although it has different semantics — behaves like a Pandas dataframe for the most part. I haven’t had much trouble moving between it and Pandas.

jononor · on April 3, 2023

You do not have to cast pandas DataFrames when using scikit-learn, for many years already. Additional in recent version there has been increasing support for also returning DataFrames, at least with transformers and checking column names/order.

wenc · on April 3, 2023

Yes that support is still not complete. When you pass a Pandas dataframe into Scikit you are implicitly doing df.values which loses all the dataframe metadata.

There is a library called sklearn-pandas which doesn’t seem to be mainstream and dev has stopped since 2022.

westurner · on April 3, 2023

From pandas-dataclasses #166 "ENH: pyarrow and optionally pydantic" https://github.com/astropenguin/pandas-dataclasses/issues/16... :

> What should be the API for working with pandas, pyarrow, and dataclasses and/or pydantic?

> Pandas 2.0 supports pyarrow for so many things now, and pydantic does data validation with a drop-in dataclasses.dataclass replacement at pydantic.dataclasses.dataclass.

Model output may or may not converge given the enumeration ordering of Categorical CSVW columns, for example; so consistent round-trip (Linked Data) schema tool support would be essential.

CuML is scikit-learn API compatible and can use Dask for distributed and/or multi-GPU workloads. CuML is built on CuDF and CuPY; CuPy is a replacement for NumPy arrays on GPUs with 100x relative performance.

CuPy: https://github.com/cupy/cupy :

> CuPy is a NumPy/SciPy-compatible array library for GPU-accelerated computing with Python. CuPy acts as a drop-in replacement to run existing NumPy/SciPy code on NVIDIA CUDA or AMD ROCm platforms.

https://cupy.dev/ :

> CuPy is an open-source array library for GPU-accelerated computing with Python. CuPy utilizes CUDA Toolkit libraries including cuBLAS, cuRAND, cuSOLVER, cuSPARSE, cuFFT, cuDNN and NCCL to make full use of the GPU architecture.

> The figure shows CuPy speedup over NumPy. Most operations perform well on a GPU using CuPy out of the box. CuPy speeds up some operations more than 100X. Read the original benchmark article Single-GPU CuPy Speedups on the RAPIDS AI Medium blog

CuDF: https://github.com/rapidsai/cudf

CuML: https://github.com/rapidsai/cuml :

> cuML is a suite of libraries that implement machine learning algorithms and mathematical primitives functions that share compatible APIs with other RAPIDS projects.*

> cuML enables data scientists, researchers, and software engineers to run traditional tabular ML tasks on GPUs without going into the details of CUDA programming. In most cases, cuML's Python API matches the API from scikit-learn.

> For large datasets, these GPU-based implementations can complete 10-50x faster than their CPU equivalents. For details on performance, see the cuML Benchmarks Notebook.

FWICS there's now a ROCm version of CuPy, so it says CUDA (NVIDIA only) but also compiles for AMD. IDK whether there are plans to support Intel OneAPI, too.

What of the non-Arrow parts of other pandas-compatible and not pandas-compatible DataFrame libraries can be ported back to Pandas (and R)?

benrutter · on April 3, 2023

My experience is that Polars generally isn't supported by third party libraries, hopefully tjst'll change soon but in the meantime it has some pretty snappy to and from pandas functionality, so converting when you hit a library that needs pandas often still brings a good deal of speedups.

wenc · on April 3, 2023

To be honest I haven’t actually worked with any third party libraries that actually supported Pandas directly — most ML libs like Scikit require Numpy arrays (not Pandas dataframes) as inputs so I’ve always had to cast to Numpy from Pandas anyway.

But yes, Polars and DuckDB can easily cast to Pandas and also read Pandas dataframes in memory. I have some legacy data transformations that are mostly DuckDB but involve some intermediate steps in Pandas (because I didn’t want to rewrite them) and it’s all seamless (though not zero copy as it would be in a pure Arrow workflow).

And ironically DuckDB can query Pandas dataframes faster than Pandas itself due to its vectorized engine.

ddotta · on April 4, 2023

You want to follow all the news related to Polars, have a look and star the Awesome Polars List ! https://github.com/ddotta/awesome-polars

berkle4455 · on April 3, 2023

> It has enabled an entire generation of data scientists to do what they do

The SQL you're using finally in 2023 has enabled data scientists to do what they do for decades. Pandas was a massive derailment and distraction in what otherwise would have been called progress.

brahbrah · on April 5, 2023

To be fair pandas was never meant to be a replacement for sql, it was meant to be a replacement for excel in financial models. Which it still excels at (pun intended).

edublancas · on April 3, 2023

How are you running your SQL queries?

If you use notebooks: my team is working on JupySQL, a tool to improve the SQL experience in Jupyter. https://github.com/ploomber/jupysql

wenc · on April 3, 2023

I’m using DuckDB in Jupyter and Python. DuckDB is the SQLite equivalent for complex analytic queries on columnar data (Parquet, CSV etc.)

EricLeer · on April 3, 2023

As someone who mainly uses pandas, what is the benefit of using DuckDB to write your queries over using pandas (or polars) to operate on the data. Is it that you can already subset the data without loading it into memory?

wenc · on April 3, 2023

I use DuckDB because I can express complex analytic queries better using a SQL mental model. Most software people hate SQL because they can never remember its syntax, but for data scientists, SQL lets us express our thoughts more simply and precisely in a declarative fashion — which gets us query plan optimization as a plus.

People have been trying to get rid of SQL for years yet they only end up reinventing it badly.

I’ve written a lot of code and the two notations I always gravitate toward are the magrittr + dplyr pipeline notation and SQL.

The chained methods notation is a bit too unergonomic especially to express window functions and complex joins.

Spark started out with method chaining but eventually found that most people used Spark SQL.

phoobahr · on April 5, 2023

In addition to this here's one really specific case: ever had a pandas groupby().apply() that took forever often mostly re-aggregating after the apply?

With columnar data DuckDuckGo is somuchfaster at this.

For one of my projects I have what sounds like a dumb workflow: - JSON api fetches get cached in sqlite3 - Parsing the JSON gets done with sqlite3 JSON operators (Fast! Fault tolerant! Handles NULLs nicely! Fast!!). - Collating data later gets queried with duckdb - everything gets munged and aggregated into the shape I want it and is persisted in parquet files - When it's time to consume it duckdb queries my various sources, does my (used to be expensive) groupbys onthefly and spits out pandas data frames - Lastly those data frames are small-ish, tidy and flexible

So yeah, on paper it sounds like these 3 libraries overlap too much to be use at the same time but in practice they can each have their place and interact well.

ritchie46 · on April 3, 2023

Polars author here. I have run the TPC-H benchmark against polars and pandas 2.0 backed by arrow types.

https://github.com/pola-rs/tpch/pull/36

Pandas having arrow as backend is great and will make interop with the arrow community (and polars) much better.

However, if you need performance, polars remains orders of magnitudes faster on whole queries, changing to the arrow memory format does not change that.

red_hare · on April 3, 2023

Biggest performance games most people will see will be dealing with strings when using the pyarrow backend since those are now a native type and not wrapped in a python object.

But for people who are looking for the performance polars gives with all the nice APIs of pandas, the big news is, since polars and pandas will now both use arrow for the underlying data, you can convert between the two kinds of dataframes without copying the data itself.

    polars_df = polars.from_pandas(df)
    # ... do performance heavy stuff ...
    df = polars.to_pandas(polars_df)

There's a good article on it here: https://datapythonista.me/blog/pandas-20-and-the-arrow-revol...

dr_kiszonka · on April 3, 2023

I find myself using Pandas .apply() and pd.read_csv() quite often which can get a little slow when dealing with lots of rows and files.

Would you happen to know if these two functions are faster in Polars?

wenc · on April 3, 2023

Read_csv - yes. I just tried loading a 5MM row csv. Pandas took 5.5 secs. Polars took 0.6 secs.

.apply - might be faster in Polars but will not be as fast as using native expressions. That’s because applying a Python function invokes the GIL which kills parallelization and this is an inherent limitation of Python.

That said, I try not to use Python functions these days. I write transformations in SQL (or native expressions in Polars) and these can be executed at full speed with complete vectorization and parallelization.

dr_kiszonka · on April 3, 2023

Thanks, wenc - I will give loading csvs with polars a try!

Moving to SQL sounds like a good idea. I just don't have the time to convert my codebase and configure everything correctly.

hn2017 · on April 3, 2023

would things like "to_sql" be faster in Polars too?

raihansaputra · on April 5, 2023

Another alternative is to try Modin. They wrap a scalable engine (doesn't have to be multiple machine, speedup happens locally from single -> multithread) with Pandas API so you just need to change your imports. .apply() is also automatically parallelized through Modin.

https://modin.readthedocs.io/en/stable/getting_started/quick...

dr_kiszonka · on April 8, 2023

This looks great! Thank you, raihansaputra!

MuffinFlavored · on April 3, 2023

What's the threshold for when you need to use something like pola.rs instead of just fitting the majority of your data set into memory? If your computer is 8GB and you have ~4-6GB of memory free, you need to be working with a data set that is at least 4GB+, right? Is that a good way to look at it?

Kind of like "do I really need k8s?", "does my workload dictate I need SQL OLAP (Online Analytical Processing) / data frame lazy loading data library?"

what's the general rule of thumb to know "you're missing out by not using existing library like pandas/polars" for somebody who is out of the loop on this kind of stuff

wenc · on April 3, 2023

I don’t know if it’s still true but Wes McKinney’s (Pandas author) rule of thumb for Pandas was “have 5 to 10 times as much as RAM as your dataset”. But he wrote this in 2017 so things may have changed.

That said, this is a 2023 comparison of Pandas and Polars memory usage.

https://pythonspeed.com/articles/polars-memory-pandas/

alfalfasprout · on April 3, 2023

A lot of libraries depend on numpy directly. Unless polars is a drop in replacement (which I could be wrong but it doesn't seem like it is) then ultimately there's no avoiding pandas in many cases.

infinite8s · on April 7, 2023

Numpy is mostly just a thin metadata wrapper around memory buffers. If pola.rs stores data in that format it should be able to give you zero-copy numpy objects pointing to the underlying pola.rs data.

nojito · on April 3, 2023

You can just call to_numpy() which is zero copy.

alphanullmeric · on April 3, 2023

Too bad polars like everything else written in rust has horrible ergonomics. I don’t know how anyone can look at the polars rust API and say in good faith that not having default arguments and named parameters was a good idea.

wenc · on April 3, 2023

Although Polars is written in Rust, most people will use Polars from Python since that what most data wranglers use. And Polars’ Python interface is excellent.

detrites · on April 3, 2023

Do you happen to know if this can be used as more or less a drop-in replace for pandas in Python? (Presuming, I suppose data conversion prior.)

wenc · on April 3, 2023

It depends on what you do with Pandas. The semantics are different so you’ll have to rewrite stuff — that said, for me it was worth it for the most part because I work with massive columnar data in Parquet.

So no, not a drop in replacement. But not a difficult transition either.

This page explains how Polars differs from Pandas.

https://pola-rs.github.io/polars-book/user-guide/coming_from...

valarauko · on April 3, 2023

As someone who loathes the Pandas syntax and lusts for the relatively cleaner tidyverse code of my colleagues, the Polar syntax just feels ... off (from the link):

df.with_columns( pl.when(pl.col("c") == 2) .then(pl.col("b")) .otherwise(pl.col("a")).alias("a") )

Seeing the multiple nested pl calls within a single expression just feels odd to me. It's definitely reminiscent of Dplyr but in a much less elegant way.

wenc · on April 3, 2023

You can use DuckDB (SQL syntax) then just convert to Polars (instantaneous).

The method chaining syntax is unwieldy in any language.

Magrittr + dplyr (tidyverse) pipeline syntax is beautiful syntax but there’s a lot of magic with NSE (nonstandard evaluation) that makes it really tricky when you need to pass a variable column name.

I’ve sort of converged on SQL as the best compromise.

valarauko · on April 3, 2023

I like the idea of Polars, and it's on my list of things to try. Unfortunately, my current codebase in heavily tied to Pandas (bioinformatics tools that use Pandas as a dependency). I also deal mostly in sparse matrices, which I'm unsure if duckdb handles.

Looking at the Polars documentation also makes me nervous, due to how much of my current Pandas-fu relies on indexing to work. I appreciate that indexes can be NSE but it's how a lot of the current tools in my field work (python and in R) with important data in the index, eg, genes or cellular barcodes, and relying on the index to merge datasets.

Another caveat for me is at multiple times in my workflow I drop into R for plots and rpy2 can convert pandas dataframes into equivalent R atomics. With Polars it would be just an additional step of converting to pandas df but just something I need to consider. That said, I've disliked the Pandas syntax for so long that the mental overhead might be well worth it.

dunefox · on April 3, 2023

Yeah, I tried to get into it but it seems very verbose.

detrites · on April 3, 2023

Thank you, that is definitely the page for me.

aguspdana · on April 3, 2023

I agree that the Rust API is still rough. But..

Polars has Python API. It's much nicer than the Rust API. Plus the documentation is more complete.

Laiho · on April 3, 2023

Couldn't agree more about the rust API. Python API seems OK. Found it very complicated to use and just clunky in general.

westurner · on April 3, 2023

One could run the benchmarks with the new version of the software under concern and report back

modriano · on April 3, 2023

Jeff Reback gave a presentation on the roadmap for Pandas at PyData NYC 2022 [0]. In it, he basically says that pandas is used so widely in industry that big breaking changes are a non-starter, there won't be any radical changes to the API, but more performant implementations can/will be built into the library (although not set as defaults, at least not for a long while). Not a revolutionary leap, but a move towards making Wes McKinney's Arrow work more accessible through pandas.

[0] https://www.youtube.com/watch?v=85XdsWz_Q_o

__mharrison__ · on April 3, 2023

This is why Modin and Ponder are game-changers for Pandas folks. You keep the API but get scaleout.

benrutter · on April 3, 2023

I know arrow support is only part way their with this release - but this is a huge deal for Pandas, for standardisation as whole, but also speed ups.

Benchmarking that was shared a while back here suggests 2x speed ups in some cases, 30x if you count strings since pandas uses python's in-built string data type[1]

[1]https://datapythonista.me/blog/pandas-20-and-the-arrow-revol...

wenc · on April 3, 2023

Polars is already built on Arrow and the advantages are tremendous.

benrutter · on April 3, 2023

Yeah Polars is an awesome library! I don't know a whole bunch about its internals, but I think it implements a bunch of additional speed ups through memory allocation and lazy evaluation, so its unlikely Pandas is will get to a similar speed without some huge changes elsewhere

wenc · on April 3, 2023

Yup here’s a list of optimizations that Polars does to achieve the speed that it does

https://pola-rs.github.io/polars-book/user-guide/#current-st...

Polars’ lazy evaluation is a big deal — this lets it do query plan optimization.

Whereas in Pandas every step is eager so it can’t look ahead to eliminate redundant steps. You basically can’t do a lot of query optimization in a multi step transform.

0cf8612b2e1e · on April 3, 2023

Not just speed, but I thought one of the big Arrow wins relative to numpy processing was better memory usage.

mattrighetti · on April 3, 2023

What's new here [0], saved you a click.

[0]: https://pandas.pydata.org/pandas-docs/version/2.0/whatsnew/v...

kinow · on April 3, 2023

Your link is broken for me, but going to their website and clicking on the 2.0 what's new link takes me to the same URL. They might be updating it... the closest I found was the Sphinx docs source for that: https://github.com/pandas-dev/pandas/blob/main/doc/source/wh...

huskyr · on April 3, 2023

Link works fine for me.

0cf8612b2e1e · on April 3, 2023

A quick skim shows a lot of quality of life improvements. Unless I am misreading, it looks like it is still a numpy backed by default. I thought one of the drivers for the 2.0 was to make Arrow the default.

kermatt · on April 3, 2023

Pre-release this was a reason [1]. Not sure if its still a compatibility reason:

> There is also an option to let pandas know we want Arrow backed types by default. The option at the time of writing this article is partially implemented and has a confusing API. In particular, it's not yet working when creating data with pandas.Series or pandas.DataFrame. And for loading data from files it will only work when the parameter use_nullable_dtypes is set to True. For example, to load a CSV file with PyArrow directly into PyArrow backed pandas Series, you can use the next code:

> pandas.options.mode.dtype_backend = 'pyarrow'

[1] https://datapythonista.me/blog/pandas-20-and-the-arrow-revol...

miohtama · on April 3, 2023

Is there any comparison, performance and feature wise, between different backends yet? Or is it too early?

wodenokoto · on April 3, 2023

strings as objects and integers turning into floats when NaNs are introduced have been a much bigger annoyance to me, than it ought to.

I'm excited to try out the new pyarrow dtypes, but it also sounds confusing that there are now 2 classes of types

agent281 · on April 3, 2023

Yeah, I stopped using pandas entirely for ETL for this exact reason. If you are trying to maintain the fidelity of the data while cleaning, automatic casting is awful. If the new backend prevents automatic casting, it might be worth reconsidering for me.

wodenokoto · on April 3, 2023

what do you use instead?

agent281 · on April 3, 2023

We're using AWS glue (basically pyspark) right now. I used standard python before.

I've implemented a function for schema based processing JSON documents for both vanilla python and pyspark that makes the process really easy. It'll take a schema and a document and product a list of flat dictionaries for python or a data frame for pyspark. Vanilla python is really streamable and keeps memory overhead low so it was actually faster than the pandas based workflows that it replaced.

hospitalJail · on April 3, 2023

>strings as objects and integers turning into floats when NaNs are introduced have been a much bigger annoyance to me, than it ought to.

Nah, you rightly are annoyed. When I am writing unit tests, it is especially annoying to fix the type.

henrydark · on April 3, 2023

Say what you will about sql, polars, pyspark, or whatever else. But nothing beats pandas' df[col].value_counts().value_counts()

humanistbot · on April 3, 2023

Can your entire dataframe fit in memory? Then you're good with pandas. If not, that's why those alternatives exist.

edublancas · on April 3, 2023

How are people managing the existence of data frame APIs like pandas/polars with SQL engines like BigQuery, Snowflake, and DuckDB?

Most of my notebooks are a mix of SQL and Python: SQL for most processing, dump the results as a pandas dataframe (via https://github.com/ploomber/jupysql) and then use Python for operations that are difficult to express with SQL (or that I don't know how to do it), so I end up with 80% SQL, 20% Python.

Unsure if this is the best workflow but it's the most efficient one I've come up with.

Disclaimer: my team develops JupySQL.

Helmut10001 · on April 3, 2023

I checked most solutions and I am sticking with plain old SQL in triple-qoutes in jupyter, e.g. [1]. I don't need auto-completion in SQL and I don't really need syntax highlighting. It's also very nice to have the combination of f-strings/variables substitution and SQL. Yet, my SQL needs are very basic.

[1]: https://kartographie.geo.tu-dresden.de/ad/wip/ephemeral_even...

__mharrison__ · on April 3, 2023

You should check out https://ponder.io .

It is created by the folks who made Mondin (a scale out version of Pandas with API compatibility as a goal). Can use dask or ray as a backend.

Ponder is the enterprise version that runs on Snowflake and BigQuery. Again, same goal, API compatibility with Pandas. You can scale out your Pandas workflow by changing the import and leaving the Pandas code.

(Full disclosure I'm an advisor.)

wswope · on April 3, 2023

As a fellow evangelist of the SQL + Pandas hybrid workflow, I’m a happy camper with Pandas’ built-in read_sql_query and to_sql.

Only big pain points are having to ship around boilerplate to construct SQLAlchemy create_engine URIs, and the performance limitations of SQLAlchemy’s inserts (if moving anything larger than a few gigs, it typically pays to ditch to_sql, and write a db-specific bulk insert process instead).

fauxpause_ · on April 3, 2023

> Accessing a single column of a DataFrame as a Series (e.g. df["col"]) now always returns a new object every time it is constructed when Copy-on-Write is enabled (not returning multiple times an identical, cached Series object). This ensures that those Series objects correctly follow the Copy-on-Write rules (GH49450)

Is this going to mean I can’t do df[‘a’] = 2 to set all values in column a to 2?

miohtama · on April 3, 2023

Copy-on-write is not enabled by default, but needs to be explicitly enabled with an option or a context manager.

fauxpause_ · on April 3, 2023

But if it were enabled, am I correct

cmcconomy · on April 3, 2023

I'd love to know when geopandas snaps into alignment

villgax · on April 3, 2023

How I wished they just changed the .apply for adding progress and parallelization by default instead of resorting to tqdm & swifter/dask or what have you

mongol · on April 3, 2023

Is it correct that I associate to a type of Chinese bears when I read about this project? Or is Pandas an acronym?

ghshephard · on April 3, 2023

"The term Panel data is derived from econometrics and is partially responsible for the name pandas − pan(el)-da(ta)-s"

https://www.tutorialspoint.com/python_pandas/python_pandas_p...

Guybrush_T · on April 3, 2023

Apparently it's taken from panel data.

bb1234 · on April 4, 2023

If interested in benchmarks comparing different dataframe implementations, here is one:

https://h2oai.github.io/db-benchmark/

HopenHeyHi · on April 3, 2023

You want to link to the actual release notes: https://pandas.pydata.org/pandas-docs/version/2.0/whatsnew/v...

nonfamous · on April 3, 2023

Not a lot of people realize that Pandas was inspired by R, and in particular the Tidyverse model of handling rectangular data frames, created originally by Hadley Wickham. These days R is primarily used by data scientists in academia and certain niche industries like pharma, but its impact goes way beyond its core user base.

next_xibalba · on April 3, 2023

> and in particular the Tidyverse model

This isn't really true.

R took data frames from S, which was using the concept at least as early as 1991.

Pandas itself predates most if not all of the tidyverse. Pandas original release occurred in 2008, whereas the first release of dplyr (one of the original packages of the tidyverse), didn't come until 2014.

nonfamous · on April 3, 2023

It's definitely true that R's base data frame (a rectangular structure with columns of mixed types, which R in turn inherited from S) was the inspiration for Pandas. The concept of verbs operating on those structures IMO was inspired by plyr (the antecedent to dplyr, first published in 2008, which introduced composability for those verbs). data.table was also an inspiration, as another commenter points out.

kgwgk · on April 3, 2023

> The concept of verbs operating on those structures IMO was inspired by plyr

Was it?

(I have no idea. But R already had verbs operating on data frames.)

marshray · on April 3, 2023

As nonfamous puts it, the concept of 'a rectangular structure with columns of mixed types' as a programming language concept goes back at least to SAS (1976). Probably older.

In terms of that organization for persistent storage it certainly goes back to the earliest computers and even pre-computer punch card sorting systems.

kgwgk · on April 3, 2023

> R took data frames from S, which was using the concept at least as early as 1991.

And, according to its author, Pandas took data frames from R - where data frames had been present from at least as early as 1997. (That part at least was true.)

disgruntledphd2 · on April 3, 2023

But were inherited from S, which was first released in 1973.

kgwgk · on April 3, 2023

S was first released in 1973 but data frames were added almost two decades later, as mentioned in another comment. They were reimplemeted -a few years later- in R and served as inspiration -a decade later- for pandas.

disgruntledphd2 · on April 4, 2023

Yeah, I know. Apologies I was on mobile, and didn't want to look up which S Book data.frames were in.

milliams · on April 3, 2023

In fact I seem to recall that the Tidyverse paper references Pandas.

jsmith99 · on April 3, 2023

R also has the data.table package, possibly the fastest dataframe package in any language.

goatlover · on April 3, 2023

Wonder if it's faster than Julia's DataFrames.jl

disgruntledphd2 · on April 3, 2023

Polars is apparently faster, though I haven't tested it myself.

jsmith99 · on April 3, 2023

The original Polars launch blog was proud of coming a distance second to data.table but I see the latest benchmarks against the python port of data.table give Polars a tiny edge.

disgruntledphd2 · on April 4, 2023

Both polars and data.table should be proud!

Fast software only ever helps.

johnmyleswhite · on April 3, 2023

> Not a lot of people realize that Pandas was inspired by R, and in particular the Tidyverse model of handling rectangular data frames, created originally by Hadley Wickham.

Can you give a citation from this, preferably from Wes himself?

Tarq0n · on April 4, 2023

Dplyr API design is leages ahead of Pandas though. In part this is because Python doesn't have first-class Symbols or lazy evaluation, but also because the dplyr authors have a better feel for ergonomics.

Sadly this 2.0 release makes clear that Pandas is unlikely to evolve its API much further.

lvl102 · on April 3, 2023

I still prefer Stata if I am completely honest with myself.

humanistbot · on April 3, 2023

You're comparing apple pie and orange juice here (even worse than comparing apples and oranges). Pandas is an open-source library for people who know python programming. Stata is an expensive GUI-based suite for non-programmers. If you're not in university, government, or non-profit, Stata's single-CPU license is $840/year in the US. Even a single-CPU student license is $94/year.

lvl102 · on April 3, 2023

Have you ever used Stata? You sound like someone who never used Stata for serious data analysis.

crimsoneer · on April 3, 2023

The only people still using stata are 40+ year old economics professors and the poor lab students they force into obsolescence. Everybody else is using R (if you're into econometrics) or Python (if you're into ML).

0cf8612b2e1e · on April 3, 2023

Stata is still big in pharma. There are moves happening to R, but it is a conservative space where businesses do not want to rock the boat too much with the FDA. If nothing else as a Python user, I think the pinning/reproducibility story in R still has quite a ways to go.

lvl102 · on April 3, 2023

Oh wow. I stand corrected. You sound like someone with decades of experience.

crop_rotation · on April 3, 2023

Do you have any counter point to the above mentioned pricing? If not, that illustrates a huge difference.

postexitus · on April 3, 2023

you sound more like someone who doesn't know what they are talking about.

binarymax · on April 3, 2023

I mean that’s fine, but is there any reason? Just throwing your half second opinion out there on a product release announcement isn’t really helpful.

dr_kiszonka · on April 3, 2023

Stata has wonderful documentation and tons of useful tools and visualizations. I stopped using it about 7 years ago because of the price and I didn't like the scripting language. Still, if someone asked me to quickly run a non-trivial regression model and I had a copy of Stata, I would probably use it.

(I have just learned about PyStata, which is very interesting...)

spaniard89277 · on April 3, 2023