Hacker News new | past | comments | ask | show | jobs | submit login

Some more tips from someone who does this every day.

1) Be careful with CSV files and UNIX tools - most big CSV files with text fields have some subset of fields that are text quoted and character-escaped. This means that you might have "," in the middle of a string. Anything (like cut or awk) that depends on comma as a delimiter will not handle this situation well.

2) "cut" has shorter, easier to remember syntax than awk for selecting fields from a delimited file.

3) Did you know that you can do a database-style join directly in UNIX with common command line tools? See "join" - assumes your input files are sorted by join key.

4) As others have said - you almost invevitably want to run sort before you run uniq, since uniq only works on adjacent records.

5) sed doesn't get enough love: sed '1d' to delete the first line of a file. Useful for removing those pesky headers that interfere with later steps. Not to mention regex replacing, etc.

6) By the time you're doing most of this, you should probably be using python or R.




Actually I would say Perl is more appropriate. I went back to Perl after 4 years for this sort of task, as it has so many features built into the syntax. Plus it can be run as a one liner.


I'm reminded of the old joke, "python is executable pseudocode, while perl is executable line noise."

But seriously, I've got some battle scars from the perl days, and hope not to revisit them. Honestly, there's very little I find I can do with perl and not python, and it's just as easy to express (if not quite as concise) and much simpler to maintain.

But, use the tool that works for you!


I use Python and Django most of the time, and its true, you can do pretty much the same thing in each language. But for quick hacky stuff manipulating the filesystem a lot, Perl has many more features built into the language. Things like regex syntax, globing directories, back ticks to execute Unix commands, and the fact you can use it directly from the command line as a one liner. You can do all these (except the last one?) in Python, but Perl is quicker.


>But for quick hacky stuff manipulating the filesystem a lot, Perl has many more features built into the language. Things like regex syntax, globing directories, back ticks to execute Unix commands

All good points.

>you can use it directly from the command line as a one liner. You can do all these (except the last one?) in Python

You can use Python from the command line too, but Perl has more features for doing that, like the -n and -p flags. Then again, Python has the fileinput module. Here's an example:

http://jugad2.blogspot.in/2013/05/convert-multiple-text-file...


> you almost invevitably want to run sort before you run uniq

And then you don't actually want uniq anyway since sort has a -u switch that removes duplicate lines.


What if you want uniq -c? Any simple way to replicate that functionality better then...sort | uniq -c?


Then you run uniq -c (which I do all the time).

But for the examples in the main article sort -u would be fine.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: