A few months ago I tried migrating a large pandas codebase to polars. I'm not mu...

nerdponx · on Jan 9, 2024

I got annoyed at the verbosity as well. Pandas is fairly verbose compared to eg data.table, but Polars really feels more like using "an API" than "a data manipulation tool".

I probably wouldn't use it for EDA or research, but I have started to use it in certain production scripts for the better performance.

R dplyr + data.table is still my favorite data manipulation experience. I just wish we had something like Matplotlib in R: ggplot is too high level, base graphics are too low level. Also Scikit-Learn is much more modular than Caret, which I don't really miss using.

dash2 · on Jan 9, 2024

Have you tried the "grid" graphics package in R? It's the basis for ggplot. It's a bit of an unsung hero, the documentation is not great, but I think it is a very solid library.

nerdponx · on Jan 9, 2024

Is it usable on its own? I only ever interacted with it in trying to hack around something I didn't like in ggplot, and it didn't seem like something I could use "by hand". In hindsight it does sound a lot like what MPL does. I can take a look!

billyzs · on Jan 9, 2024

> I just wish we had something like Matplotlib in R

plotly could be worth a try, i use its python bindings and much prefer it to matplotlib, but i don't know much about the quality of it's R API

blt · on Jan 9, 2024

yeah, I haven't used Polars but from skimming the docs it looks kind of enterprisey. I don't want to type `df.select(pl.col("a"))` instead of `df["a"]`.

theLiminator · on Jan 9, 2024

Latter also works.

snthpy · on Jan 9, 2024

I am very curious to know how you feel about PRQL (prql-lang.org) ? It aims to give you the ergonomics dplyr wherever you use SQL (by compiling to SQL).

IMHO this gives you the DX of dplyr / Polars / Pandas combined with the power and universality of SQL because you can still execute your queries on any SQL compatible query execution engine of your choice, including Polars and Pandas but also DuckDB, ClickHouse, BigQuery, Redshift, Postgres, Trino/Presto, SQLite, ... to name just a few popular ones.

I'd love to hear your thoughts, either in a Discussion on Github (https://github.com/PRQL/prql/discussions) or on our Discord (https://discord.com/invite/XWxbCrWr)!

Disclaimer: I'm a PRQL contributor.

aexl · on Jan 9, 2024

Nice, you have experience in data frames in R, Python and Julia! Which one of those do you like the most? I know that the ecosystem isn't really comparable, but from your experience, which one is the best to work with for core operations, etc.?

nerdponx · on Jan 9, 2024

Not OP but R data.table + dplyr is an unbeatable combo for data processing. I handily worked with 1bn record time series data on a 2015 MBP.

The rest of the tidyverse stuff is OK (like forcats), but the overall ecosystem is a little weird. The focus on "tidy" data itself is nice up to a point, but sometimes you want to just move data around in imperative style without trying to figure out which "tidy verb" to use, or trying to learn yet another symbol interpolation / macro / nonstandard eval system, because they seem to have a new one every time I look.

Pandas is a real workhorse overall. Data.table is like a fast sports car with a very complicated engine, and Pandas is like a work van. It's a little of everything and not particularly excellent at anything and that's ok. Also its index/multiindex system is unique and powerful. But data.table always smoked it for single-process in-memory performance.

Until DuckDB and Polars, there was no Python equivalent of data.table at all. They're great when you want high performance, native Arrow (read: Parquet) support, and/or an interface that feels more like a programming library than a data processing tool. If you're coming from a programming background, or if you need to do some data processing or analytics inside of production system, those might be good choices. The Polars API will also feel very familiar to users of Spark SQL.

For geospatial data, Pandas is by far superior to all options due to GeoPandas and now SpatialPandas. There is an alpha-stage GeoPolars library but I have no idea who's working on it or how productive they will be.

If you had to learn one and only one, Pandas might still be the best option. Python is a much better general-purpose language than R, as much as I love R. And Pandas is probably the most flexible option. Its index system is idiosyncratic among its peers, but it's quite powerful once you get used to using it, and it enables some interesting performance optimization opportunities that help it scale up to data sets it otherwise wouldn't be able to handle. Pandas also has pretty good support for time series data, e.g. aggregating on monthly intervals. Pandas also has the most extensibility/customizability, with support for things like custom array back ends and custom data types. And its plotting methods can help make Matplotlib less verbose.

I've never gotten past "hello world" with Julia, not for lack of interest, but mostly for lack of time and need. I would be interested to hear about that comparison as well.

hpcjoe · on Jan 9, 2024

At a previous job, I regularly worked with dfs of millions to hundreds of millions of rows, and many columns. It was not uncommon for the objects I was working with to use 100+ GB ram. I coded initially in Python, but moved to Julia when the performance issues became to painful (10+ minute operations in Python that took < 10s in Julia).

DataFrames.jl, DataFramesMeta.jl, and the rest of the ecosystem are outstanding. Very similar to pandas, and much ... much faster. If you are dealing with small (obviously subjective as to the definition of small) dfs of around 1000-10000 rows, sticking with pandas and python is fine. If you are dealing with large amounts of real world time series data, with missing values, with a need for data cleanup as well as analytics, it is very hard to beat Julia.

FWIW, I'm amazed by DuckDB, and have played with it. The DuckDB Julia connector gives you the best of both worlds. I don't need DuckDB at the moment (though I can see this changing), and use Julia for my large scale analytics. Python's regex support is fairly crappy, so my data extraction is done using Perl. Python is left for small scripts that don't need to process lots of information, and can fit within a single terminal window (due to its semantic space handicap).

nerdponx · on Jan 9, 2024

That's a nice endorsement, I've always liked the idea of Julia as an R replacement. I'll definitely give that a shot when I have a chance.

Is there any kind of decent support for plotting with data frames? Or does Plots.jl work with it out of the box?

xgdgsc · on Jan 10, 2024

Plots.jl can work. You may also want to try https://makie.org/ and https://github.com/TidierOrg/TidierPlots.jl

sanderjd · on Jan 9, 2024

Ha I like your description of pandas as a work van. I totally have that same feel for it. It's great because it works, not because it's great :)

sanderjd · on Jan 9, 2024

I'm not the person you replied to, but I have experience with all of these. My background is computer science / software engineering, incorporating data analysis tools a few years into my career, rather than starting with a data analysis focus and figuring out tools to help me with that. In my experience, this seems to lead to different conclusions than the other way around.

tldr: Julia is my favorite.

I could never click with R. It is true that data.table and dplyr and ggplot are well done and I think we owe a debt of gratitude to the community that created them. But the language itself is ... not good. But that's just, like, my opinion!

Pandas I also have really never clicked with. But I like python a lot more than R, and pandas basically works. For what it's worth, the polars api style is more my thing. But most of the data scientists I work with prefer the pandas style, :shrug:.

But I really like this part of Julia. It feels more "native" to Julia than pandas does to python. More like data.table in R, but embedded in a, IMO, even better language than python. The only issue is that Julia itself remains immature in a number of ways and who knows whether it will ever overcome that. But I hope it does!

nerdponx · on Jan 9, 2024

I sympathize with anyone who doesn't like R. Even as a statistics/math DSL it's really wonky.

But it's a lot more fun when you realize that it's an homoiconic array language with true lazily-evaluated F-exprs (not Rebol/Tcl strings).

sanderjd · on Jan 9, 2024

I realized that (not in so many words...) pretty quickly and do not like it at all :)

theLiminator · on Jan 9, 2024

Maybe give ibis with the duckdb backend a try, though personally I quite like polars. The devs are pretty fix to respond to issues overall.