Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Would be interesting if you could expand. I've used r (data.table) extensively in the last years for biostatistics in a research organization. I was able to get away with not learning tidyverse and stick to data.table. Main reason for choosing data.table was speed - I'm working with tens of hundred of GB of data at once.


What's worked for me is reading Hadley Wickham's "Tidy Data" paper[0] and then applying the concepts with data.table. The speed is nice, but I really love what's possible with data.table syntax and how any packages work with it. That's opposed to what many people have decided "tidy" means, with non-standard evaluation and functions that take whole tables and symbols of column names instead of vectors.

[0]: https://vita.had.co.nz/papers/tidy-data.html


Compared to data.table, tidyverse offers significantly better readability and ergonomics in exchange for worse computational and memory efficiency, with the magnitude of the performance ranging from negligible to catastrophic depending on the operation and your data volume. At that data volume, you're probably doing some things that would OOM or hang for days if you translated your data.table code to the corresponding tidyverse code.


dtplyr is an option as well which lets you use tidyverse syntax with data.table backend. Speed and syntax.

If you learned data.table however, it's better to just stay in data.table. Nothing in the tidyverse can touch the efficiency of data table.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: