Hacker News new | past | comments | ask | show | jobs | submit login

strings as objects and integers turning into floats when NaNs are introduced have been a much bigger annoyance to me, than it ought to.

I'm excited to try out the new pyarrow dtypes, but it also sounds confusing that there are now 2 classes of types




Yeah, I stopped using pandas entirely for ETL for this exact reason. If you are trying to maintain the fidelity of the data while cleaning, automatic casting is awful. If the new backend prevents automatic casting, it might be worth reconsidering for me.


what do you use instead?


We're using AWS glue (basically pyspark) right now. I used standard python before.

I've implemented a function for schema based processing JSON documents for both vanilla python and pyspark that makes the process really easy. It'll take a schema and a document and product a list of flat dictionaries for python or a data frame for pyspark. Vanilla python is really streamable and keeps memory overhead low so it was actually faster than the pandas based workflows that it replaced.


>strings as objects and integers turning into floats when NaNs are introduced have been a much bigger annoyance to me, than it ought to.

Nah, you rightly are annoyed. When I am writing unit tests, it is especially annoying to fix the type.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: