How are people managing the existence of data frame APIs like pandas/polars with SQL engines like BigQuery, Snowflake, and DuckDB?
Most of my notebooks are a mix of SQL and Python: SQL for most processing, dump the results as a pandas dataframe (via https://github.com/ploomber/jupysql) and then use Python for operations that are difficult to express with SQL (or that I don't know how to do it), so I end up with 80% SQL, 20% Python.
Unsure if this is the best workflow but it's the most efficient one I've come up with.
I checked most solutions and I am sticking with plain old SQL in triple-qoutes in jupyter, e.g. [1]. I don't need auto-completion in SQL and I don't really need syntax highlighting. It's also very nice to have the combination of f-strings/variables substitution and SQL. Yet, my SQL needs are very basic.
It is created by the folks who made Mondin (a scale out version of Pandas with API compatibility as a goal). Can use dask or ray as a backend.
Ponder is the enterprise version that runs on Snowflake and BigQuery. Again, same goal, API compatibility with Pandas. You can scale out your Pandas workflow by changing the import and leaving the Pandas code.
As a fellow evangelist of the SQL + Pandas hybrid workflow, I’m a happy camper with Pandas’ built-in read_sql_query and to_sql.
Only big pain points are having to ship around boilerplate to construct SQLAlchemy create_engine URIs, and the performance limitations of SQLAlchemy’s inserts (if moving anything larger than a few gigs, it typically pays to ditch to_sql, and write a db-specific bulk insert process instead).
Most of my notebooks are a mix of SQL and Python: SQL for most processing, dump the results as a pandas dataframe (via https://github.com/ploomber/jupysql) and then use Python for operations that are difficult to express with SQL (or that I don't know how to do it), so I end up with 80% SQL, 20% Python.
Unsure if this is the best workflow but it's the most efficient one I've come up with.
Disclaimer: my team develops JupySQL.