Recently was sorting a 10 million line CSV by the second field which was numerical. After an hour went by and it wasn't done, I poked around online and saw a suggestion to put the field sorted on first.
One awk command later my file was flipped. Run same exact sort command on this but without specifying field. Completed in 12 seconds.
Morals:
1. Small changes can have a 3+ orders of magnitude effect on performance
2. Use the Google, easier than understanding every tool on a deep enough level to figure this out yourself ;)
csv files are extremely easy to import in postgres, and 10 M rows (assuming not very large) isn't much to compute even in a 6 or 7 year old laptop. Keep it in mind if you've got something slightly more complicated to analyse.
Oh nice, I hadn't seen this before, so a similar query would be a bit shorter!
wget "https://data.cityofnewyork.us/api/views/kku6-nxdu/rows.csv"
csvsql --query "SELECT * FROM newyorkdata ORDER BY `COUNT PARTICIPANTS`" rows.csv > new.csv
This is really interesting to me - I can't seem to reproduce this on a couple different platforms. In the past (and just now) I've definitely seen 100x differences with LC_ALL=C, but you used that on both, I'm sure.
How large was the 10 million line file, and do you know whether sort was paging files to disk?
One awk command later my file was flipped. Run same exact sort command on this but without specifying field. Completed in 12 seconds.
Morals:
1. Small changes can have a 3+ orders of magnitude effect on performance
2. Use the Google, easier than understanding every tool on a deep enough level to figure this out yourself ;)