Recently was sorting a 10 million line CSV by the second field which was numeric...

dorfsmay · on May 24, 2018

csv files are extremely easy to import in postgres, and 10 M rows (assuming not very large) isn't much to compute even in a 6 or 7 year old laptop. Keep it in mind if you've got something slightly more complicated to analyse.

hobs · on May 24, 2018

If SQL is your game but you dont want to get PG setup - try SQLite -

  wget "https://data.cityofnewyork.us/api/views/kku6-nxdu/rows.csv" rows.csv
  sqlite3
  .mode csv
  .import ./rows.csv newyorkdata
  SELECT *
  FROM newyorkdata
  ORDER BY `COUNT PARTICIPANTS`;

mxuribe · on May 24, 2018

Nice! Talk about being a productive engineer! Kudos!

jillesvangurp · on May 26, 2018

Or use csvkit ...

hobs · on May 27, 2018

Oh nice, I hadn't seen this before, so a similar query would be a bit shorter!

  wget "https://data.cityofnewyork.us/api/views/kku6-nxdu/rows.csv" 
  csvsql --query "SELECT * FROM newyorkdata ORDER BY `COUNT PARTICIPANTS`" rows.csv > new.csv

ranit · on May 24, 2018

What did you use to sort these 10 million lines?

ikeboy · on May 24, 2018

sort in bash.

Specific command:

LC_ALL=c sort -n filename -r > output

cmurphycode · on May 25, 2018

This is really interesting to me - I can't seem to reproduce this on a couple different platforms. In the past (and just now) I've definitely seen 100x differences with LC_ALL=C, but you used that on both, I'm sure.

How large was the 10 million line file, and do you know whether sort was paging files to disk?

ikeboy · on May 25, 2018

About 300MB. I just tried to replicate using random data and failed, will look more later, mine is power-law distributed and might make a difference.

Used these commands:

cat /dev/urandom | tr -dc 'a-zA-Z0-9' | fold -w 32 | head -n 10000000 > text

cat /dev/urandom | od -DAn |tr -s ' ' '\n' |head -10000000 | sed -e 's/$/\/9.0/'> num

bc -l num > num2

(that's to create decimal numbers)

paste -d, text num2 > textfirst

paste -d, num2 text > numfirst

time LC_ALL=c sort -n numfirst -r > output

real 0m10.712s user 0m21.959s sys 0m1.644s

time LC_ALL=C sort --field-separator=',' -k2 -n -r textfirst > output

real 0m9.638s user 0m21.940s sys 0m1.293s

mynegation · on May 24, 2018

but what did you use to sort by the second field, in your original setup?

ikeboy · on May 24, 2018

LC_ALL=C sort --field-separator=',' -k2 -n

bewuethr · on May 24, 2018

This uses field 2 to the end of the line as the sort field; would

  k2,2

have been faster, as it limits to just the second field?

ikeboy · on May 24, 2018

There were only 2 fields so I expect not.