Yeah, I stopped using pandas entirely for ETL for this exact reason. If you are ...

wodenokoto · on April 3, 2023

what do you use instead?

agent281 · on April 3, 2023

We're using AWS glue (basically pyspark) right now. I used standard python before.

I've implemented a function for schema based processing JSON documents for both vanilla python and pyspark that makes the process really easy. It'll take a schema and a document and product a list of flat dictionaries for python or a data frame for pyspark. Vanilla python is really streamable and keeps memory overhead low so it was actually faster than the pandas based workflows that it replaced.