Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Recently was sorting a 10 million line CSV by the second field which was numerical. After an hour went by and it wasn't done, I poked around online and saw a suggestion to put the field sorted on first.

One awk command later my file was flipped. Run same exact sort command on this but without specifying field. Completed in 12 seconds.

Morals:

1. Small changes can have a 3+ orders of magnitude effect on performance

2. Use the Google, easier than understanding every tool on a deep enough level to figure this out yourself ;)



csv files are extremely easy to import in postgres, and 10 M rows (assuming not very large) isn't much to compute even in a 6 or 7 year old laptop. Keep it in mind if you've got something slightly more complicated to analyse.


If SQL is your game but you dont want to get PG setup - try SQLite -

  wget "https://data.cityofnewyork.us/api/views/kku6-nxdu/rows.csv" rows.csv
  sqlite3
  .mode csv
  .import ./rows.csv newyorkdata
  SELECT *
  FROM newyorkdata
  ORDER BY `COUNT PARTICIPANTS`;


Nice! Talk about being a productive engineer! Kudos!


Or use csvkit ...


Oh nice, I hadn't seen this before, so a similar query would be a bit shorter!

  wget "https://data.cityofnewyork.us/api/views/kku6-nxdu/rows.csv" 
  csvsql --query "SELECT * FROM newyorkdata ORDER BY `COUNT PARTICIPANTS`" rows.csv > new.csv


What did you use to sort these 10 million lines?


sort in bash.

Specific command:

LC_ALL=c sort -n filename -r > output


This is really interesting to me - I can't seem to reproduce this on a couple different platforms. In the past (and just now) I've definitely seen 100x differences with LC_ALL=C, but you used that on both, I'm sure.

How large was the 10 million line file, and do you know whether sort was paging files to disk?


About 300MB. I just tried to replicate using random data and failed, will look more later, mine is power-law distributed and might make a difference.

Used these commands:

cat /dev/urandom | tr -dc 'a-zA-Z0-9' | fold -w 32 | head -n 10000000 > text

cat /dev/urandom | od -DAn |tr -s ' ' '\n' |head -10000000 | sed -e 's/$/\/9.0/'> num

bc -l num > num2

(that's to create decimal numbers)

paste -d, text num2 > textfirst

paste -d, num2 text > numfirst

time LC_ALL=c sort -n numfirst -r > output

real 0m10.712s user 0m21.959s sys 0m1.644s

time LC_ALL=C sort --field-separator=',' -k2 -n -r textfirst > output

real 0m9.638s user 0m21.940s sys 0m1.293s


but what did you use to sort by the second field, in your original setup?


LC_ALL=C sort --field-separator=',' -k2 -n


This uses field 2 to the end of the line as the sort field; would

  k2,2
have been faster, as it limits to just the second field?


There were only 2 fields so I expect not.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: