Hacker News new | past | comments | ask | show | jobs | submit login

Yeah, I stopped using pandas entirely for ETL for this exact reason. If you are trying to maintain the fidelity of the data while cleaning, automatic casting is awful. If the new backend prevents automatic casting, it might be worth reconsidering for me.



what do you use instead?


We're using AWS glue (basically pyspark) right now. I used standard python before.

I've implemented a function for schema based processing JSON documents for both vanilla python and pyspark that makes the process really easy. It'll take a schema and a document and product a list of flat dictionaries for python or a data frame for pyspark. Vanilla python is really streamable and keeps memory overhead low so it was actually faster than the pandas based workflows that it replaced.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: