strings as objects and integers turning into floats when NaNs are introduced hav...

agent281 · on April 3, 2023

Yeah, I stopped using pandas entirely for ETL for this exact reason. If you are trying to maintain the fidelity of the data while cleaning, automatic casting is awful. If the new backend prevents automatic casting, it might be worth reconsidering for me.

wodenokoto · on April 3, 2023

what do you use instead?

agent281 · on April 3, 2023

We're using AWS glue (basically pyspark) right now. I used standard python before.

I've implemented a function for schema based processing JSON documents for both vanilla python and pyspark that makes the process really easy. It'll take a schema and a document and product a list of flat dictionaries for python or a data frame for pyspark. Vanilla python is really streamable and keeps memory overhead low so it was actually faster than the pandas based workflows that it replaced.

hospitalJail · on April 3, 2023

>strings as objects and integers turning into floats when NaNs are introduced have been a much bigger annoyance to me, than it ought to.

Nah, you rightly are annoyed. When I am writing unit tests, it is especially annoying to fix the type.