Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I have been using AWK at work and found it cumbersome for my use-case (run-once analysis/manipulation of 200GB csv datasets). I found out about Miller[1] a year or so back and have been using that instead. I don't know how it stacks up in terms of performance, but for my money, named arguments and one-shot statistics is all I need.

For example analysing the number of people per-year over multiple differently formatted files is as easy as `mlr uniq -f pid,year then count -g year`. It has filtering, very extensive manipulation capabilities and a nice documentation.

John has also reacted very fast on feature requests I had (so fast that I've not yet implemented using them).

[1]: https://miller.readthedocs.io/en/latest/#overview




+1 for Miller. It's a really slick tool for exploring large csv datasets. I've typically used it to do some prototyping and exploration of 200-500GB csv datasets before doing more hefty work in Java+PigLatin (our use-case is more long-term than just a single run for analysis, so that's re reason for moving out of just Miller). It's great to get a feel for the data and run some initial analysis before diving into the larger, more cumbersome system and tooling.


This seems very interesting.

Making this reproducable and usable for multiple people has always been an annoyance in semi-ad-hoc data collection.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: