"While dealing with big genetic data sets I often got stuck with limitation of programming languages in terms of reading big files."
Hate to sound like Steve-Jobs here, but: "You're using it wrong."
Let me elaborate. If you're coming across limitations of "too-big" or "too-long" in your language of choice: Then you're just a few searches away from both being enlightened on how to solve your task at hand and on how your language works. Both things that will prevent you from being hindered next time around when you have to do a similar big-data job.
Perhaps you are more comfortable using pre-defined lego-blocks to build your logic. Perhaps you understand the unix commands better than you do your chosen language. But understand that programming is the same, just in a different conceptual/knowledge space. And remember, always use the right tool for the job!
(I use Unix commands daily as they're quick/dirty in a jiffy, but for complex tasks I am more productive solving the problem in a language I am comfortable in instead of searching through man pages for obscure flags/functionality)
Yeah, I think calling him the "unenlightened" one is pretty off base here.
For performing the tasks outlined by his examples, Unix utilities are easier for the user as well as executing faster than writing your own code in a general purpose programming language, unless one puts in the time to tune the implementation.
One could rebuild AWK in C and get similar performance, but why not just use some extremely simple AWK? And is anybody going to replicate the amazing speed of grep without a huge time investment? [1]
There is a right tool for the job, and having seen dozens of programmers be exposed to new big data sets, I can tell you that the ones who become productive quickly and stay more productive are the ones who adopt whatever tool is best for the job, not the ones that stick to their one favorite programming language. In fact, a good sign of somebody who will quickly fail is someone who says "forget those Unix tools, I'm just going to write this in X".
This is one area where I wish the Unix philosophy (reuse of tools) was taken a bit further. Too me, every command should be callable as a C library function. That way you wouldn't have to parse the human readable output through a pipe. Not only that, there needs to be both human-readable, as well as machine-readable output to all commands. For example I would love to be able to call "ps" from another script and easily select specific columns from an xml or json output.
Shell scripts can be pretty powerful if you know what you're doing, but I do agree that sometimes the shell script paradigm can be more of a hurdle than a help.
However your point about every command being a callable as a C library is kind of possible already. Some commands do have native language libraries (eg libcurl), but you could also fork out to those ELFs if you're feeling really brave (though in all practicality - it's little worse than writing a shell script to begin with). In fact there's times I've been known to cheat with Perl and run (for example):
(my $hostname = `hostname`) =~ s/\n//g;
because it's quicker and easier to throw together than using the proper Perl libraries (yeah, it's pretty nasty from an academic perspective, but the additional footprint is minimal while the development time saved is significant.
Of course, any such code that's used regularly and/or depended on will be cleaned up as and when I have the time.
As for your XML or JSON parsing; the same theory as above could be applied:
use JSON::Parse 'parse_json';
my $json = `curl --silent http://birthdays.com/myfriends.json`;
my $bdays = parse_json($json);
print "derekp7's birthday is $bdays{derekp7}";
Obviously these aren't best practices, but if it only running locally (ie this isn't part of a CGI (etc) script that's web accessible) and gets the job done in a hurry then I can't see why I you shouldn't use that for ad hoc reporting.
Most scripting languages aren't multithreaded, and some aren't pipeline oriented by default.
For example, working with file lines naively in Ruby means reading the whole lot into a giant array and doing transformations an array at a time, rather than in a streaming fashion.
The shell gives you fairly safe concurrency and streaming for free.
Personally, if it's a complex task, I generally write a tool such that it can be put into a shell pipeline.
Knowing the command line well - so that you don't often have to look up man pages for obscure flags / functionality - has its own rewards, as these commands turn into something you use all the time in the terminal. Rather than spending a few minutes developing a script in an editor, you can incrementally build a pipeline over a few seconds. Doing your script in a REPL is a better approximation, but it's a bit less immediate.
You don't have to read all of a file into memory in Ruby. There are a number of facilities for reading only a portion of a file, readpartial[1] for example. Additionally, you have access to all of the native pipe[2] functionality as well. There are plenty of reasons to favor shell tools over Ruby, but those aren't some of them.
the problem with processing big files with ruby (in my humble experience) is usually that it's still slow enough that "preprocessing with grep&uniq" is worthwhile.
Standard grep is much faster on multi-gigabyte files than anything you can figure out how to do in your pet language. By the time you get close to matching grep, you would have reimplemented most of grep, in half-assed fashion at that.
Your delusion is assuming standard command line tools are simple in function because they have a simple interface that Average Joe can use.
ag is often faster when you're using it interactively, replacing "grep -r" (in particular in version controlled dirs). It's also faster in the sense that for interactive use it will often DWYM.
But has too many weird quirks that it can replace grep for data munging. E.g.
$ ag verb fullform_nn.txt >/tmp/verbs
ERR: Too many matches in fullform_nn.txt. Skipping the rest of this file.
Man ag says there's a --max-count option. Let's try that.
$ grep -c verb fullform_nn.txt
206077
$ ag --max-count 206077 verb fullform_nn.txt >/tmp/verbs
ERR: Too many matches in fullform_nn.txt. Skipping the rest of this file.
Wtf? (and running those two commands with "time" gave ag user 0m0.770s while grep had user 0m0.057s)
I have never used ag, but in most instances where people thought they made a faster grep it is becuase it doesn't handle multibyte encodings correctly.
I would allow that for some definition of "too-big".
I wrote a distributed grep impl a few years back to grep my logs and collect output to a central machine (a vague "how may machines had this error" job).
The central orchestration was easy in python, but implementing
zgrep | awk | sort | uniq -c | wc -l
is way faster and way more code in python than to do it with shell (zgrep is awesome for .gz logs).
On the other hand, the shell co-ordinator way way harder using pdsh that I reverted to using paramiko and python threadpools.
Unix tools are extremely composable and present in nearly every machine with the standard behaviour.
> Hate to sound like Steve-Jobs here, but: "You're using it wrong."
I don't quite agree — say this individual needs to sort a file by two columns. Should they really load everything into memory to call Python's sorted()? With large genomics datasets this isn't possible. Trying to reimplement sort's on-disk merge sort would be unnecessary and treacherous.
It's easy to forget how much engineer went into these core utilities — which can be useful when working with big files.
It's not hard to write an on-disk merge sort using Python... it just may not be that fast.
But really, as I'm sure you know, for genome-scale datasets, the key word is streaming. Disk IO is a major bottleneck. If you're using a large genomic dataset, you shouldn't be sorting your results in text format anyway... it would take way too much time and temporary disk space. What you'd probably want is a row filter to extract out the rows of interest. Or, you'd be calculating some other kind of non-trivial summary statistics. In both of these cases, you'd need to use some kind of custom program. But you'd still should be operating on the stream, not the entire dataset.
(Of the top of my head I can think of only a few instances where you'd need to operate on a column as opposed to a row in genome data - multiple testing correction being the main one)
If you need to sort by two columns, yes, by all means use "sort". It's about as fast as you are going to get. But for "exploratory analysis" on genomic data, you'd better have a really good reason (or small dataset) to use these tools.
> If you're using a large genomic dataset, you shouldn't be sorting your results in text format anyway... it would take way too much time and temporary disk space. What you'd probably want is a row filter to extract out the rows of interest.
For repeated queries, this isn't efficient. This is why we have indexed, sorted BAM files compressed with BGZF (and tabix, which uses the same ideas). Many queries in genomics are with respect to position, and this is O(1) with a sorted, indexed file and O(n) with streaming. Streaming also involves reading and uncompressing the entire file from the disk — accessing entries with from an indexed, sorted file involves a seek() to the correct offset and decompressing that particular block – this is far more efficient. I definitely agree that streaming is great, but sorting data is an essential trick in working with large genomics data sets.
I'm well aware of BAM and BGZF. I've even written a parser or two (yay double encoded binary formats). I really like the BGZF format and think it doesn't get used enough outside of sequencing. It's basically gzip frames with enough information to allow random access within the compressed stream. And tabix is a great way to efficiently index otherwise unindexable text files.
However, these are all binary formats. I specifically said that you shouldn't sort genome files in text format. Because while text is easy to parse, binary is faster for this. You aren't going to use any of the standard unix tools once you've crossed over into binary formats. And so you are stuck using domain specific tools. Stuff like http://ngsutils.org (shameless plug).
I have seen people write tools for parsing SAM files using unix core utils, but they are always orders of magnitude slower than a good BAM-specific tool (in almost any language).
Just to point out, "big genetic datasets" can be hundreds of gigabytes to terabytes, sizes which are not possible to process easily in any language. A highly optimized C program like uniq might be the fastest option short of a distributed system.
When you have to deal with (genetic data) files of few GB on daily basis, I dont think using Python, R or databases is good idea to do basic data exploring.
As someone that deals with large datasets on a Database + Python daily, I'm not quite sure what you mean. You'll have to explain it to me what "not a good idea is", or "basic data exploring".
Consider I get 10 files of size 3 GB every week, which I am supposed to filter based on certain column using a reference index and forward to my colleague. Before filtering I also want to check how the file looks like: column names, first few records etc.
I can use something like following to explore few rows and few columns.
$$ awk '{print $1,$3,$5}' file | head -10
And then I can use something like sed with reference index to filter the file. Since I plan to repeat this with different files, databases would be time consuming(even if I automate it loading every file and querying). Due to the file size options like R, Python would be slower than unix commands. I can also save set of commands as script and share/run whenever I need it.
If there is a better way I would be happy to learn.
I think the gain you're seeing there is because it's quicker for you to do quick, dirty ad hoc work with the shell than it is to write custom python for each file. Which totally makes sense, the work's ad hoc so use an ad hoc tool. Python being slow and grep being a marvel of optimization doesn't really matter, here, compared to the dev time you're saving.
I have been doing Python the last few years, but went back to Perl for this sort of thing recently. You can start with a one liner, and if it gets complicated, just turn it into a proper script. As well as the Unix commands mentioned. Its just faster when you don't know what you are dealing with yet.
Some more tips from someone who does this every day.
1) Be careful with CSV files and UNIX tools - most big CSV files with text fields have some subset of fields that are text quoted and character-escaped. This means that you might have "," in the middle of a string. Anything (like cut or awk) that depends on comma as a delimiter will not handle this situation well.
2) "cut" has shorter, easier to remember syntax than awk for selecting fields from a delimited file.
3) Did you know that you can do a database-style join directly in UNIX with common command line tools? See "join" - assumes your input files are sorted by join key.
4) As others have said - you almost invevitably want to run sort before you run uniq, since uniq only works on adjacent records.
5) sed doesn't get enough love: sed '1d' to delete the first line of a file. Useful for removing those pesky headers that interfere with later steps. Not to mention regex replacing, etc.
6) By the time you're doing most of this, you should probably be using python or R.
Actually I would say Perl is more appropriate. I went back to Perl after 4 years for this sort of task, as it has so many features built into the syntax. Plus it can be run as a one liner.
I'm reminded of the old joke, "python is executable pseudocode, while perl is executable line noise."
But seriously, I've got some battle scars from the perl days, and hope not to revisit them. Honestly, there's very little I find I can do with perl and not python, and it's just as easy to express (if not quite as concise) and much simpler to maintain.
I use Python and Django most of the time, and its true, you can do pretty much the same thing in each language. But for quick hacky stuff manipulating the filesystem a lot, Perl has many more features built into the language. Things like regex syntax, globing directories, back ticks to execute Unix commands, and the fact you can use it directly from the command line as a one liner. You can do all these (except the last one?) in Python, but Perl is quicker.
>But for quick hacky stuff manipulating the filesystem a lot, Perl has many more features built into the language. Things like regex syntax, globing directories, back ticks to execute Unix commands
All good points.
>you can use it directly from the command line as a one liner. You can do all these (except the last one?) in Python
You can use Python from the command line too, but Perl has more features for doing that, like the -n and -p flags. Then again, Python has the fileinput module. Here's an example:
>> If we don't want new file we can redirect the output to same file which will overwrite original file
You need to be a little careful with that. If you do:
uniq -u movies.csv > movies.csv
The shell will first open movies.csv for writing (the redirect part) then launch the uniq command connecting stdout to the now emptied movies.csv.
Of course when uniq opens movies.csv for consumption, it'll already be empty. There will be no work to do.
There's a couple of options to deal with this, but the temporary intermediate file is my preference provided there's sufficient space - it's easily understood, if someone else comes across the construct in your script, they'll grok it.
sponge is cool. But on debian/ubuntu, it's packaged up in moreutils, which includes a few helpful tools. However a programme called parallel is in moreutils, and that's not as powerful as GNU's parallel. So I often end up uninstalling sponge/moreutils. :(
The classic book "The UNIX Programming Environment" by Kernighan and Pike, has a tool in it, called 'overwrite', that does this - letting you safely overwrite a file with the result of a oommand or pipeline, IIRC.
I came here for that. I learned long ago, the hard way, to never ever use the same file for writing as reading. I was wondering if that rule had changed on me.
The rename is atomic; anyone opening "filename" will get either the old version, or the new version. (Although it breaks one of my other favorite idioms for monitoring log files, "tail -f filename", because the old inode will never be updated.)
you could directly write to uniqMovie.csv in your example. I would do it like below but ONLY once I am certain it is exactly what I want. Usually I just make one clearly named result file per operation without touching the original.
I think there's something like that in the Kernighan and Pike book I referred to elsewhere in this thread, and also, that code looks similar to this technique:
I've just started using it, and the only limitation I've so far encountered has been that there's no equivalent to awk (i.e. I want a way to evaluate a python expression on every line as part of a pipeline).
Sorry, I meant: remove the characters "$" and "," from the 3rd column of a CSV file. Obviously the CSV file is quoted, since it has commas in the 3rd column, and so awk is no longer an acceptable solution.
Not to sound too much like an Amazon product page, but if you like this, you'll probably quite like "Unix for Poets" - http://www.lsi.upc.edu/~padro/Unixforpoets.pdf . It's my favourite 'intro' to text/data mangling using unix utils.
I'd like to repeat peterwwillis in saying that there are very Unixy tools that are designed for this, and update his link to my favorite, csvfix: http://neilb.bitbucket.org/csvfix/
Neat selling points: csvfix eval and csvfix exec
also: the last commit to csvfix was 6 days ago; it's active, mature, and the developer is very responsive. If you can think of a capability that he hasn't yet, tell him and you'll have it in no time:)
If you're on Windows, you owe it to yourself to check out a little known Microsoft utility called logparser: http://mlichtenberg.wordpress.com/2011/02/03/log-parser-rock... It effectively lets you query a CSV (or many other log file formats/sources) with a SQL-like language. Very useful tool that I wish was available on Linux systems.
LogParser is one of the few things I really miss from windows. I think there are unix equivalents, but I haven't had the time to invest in learning them. Pretty much every example in this article boiled down to 'Take this CSV and run a simple SQL query on it'. Yes you can do that by piping through various unix utilities or you could just use a tool mean specifically for the task. I'd like to see the article explore some more advanced cases, like rolling up a column. I actually had to do this yesterday and ended up opening my data in open office and using a pivot table.
I just bought the early release of that exact book for $13.60, which was 60% off, because you get 60% off if you order $100 worth of prediscount ebooks.
When the book is finished you get the final version. It's mostly already finished.
"With Early Release ebooks, you get books in their earliest form — the author's raw and unedited content as he or she writes — so you can take advantage of these technologies long before the official release of these titles. You'll also receive updates when significant changes are made, new chapters as they're written, and the final ebook bundle."
You can also find tools designed for your dataset, like csvkit[1] , csvfix[2] , and other tools[3] (I even wrote my own CSV munging Unix tools in Perl back in the day)
If I'm working with a datafile where I expect the delimiter to be in one of the fields, there is something wrong.
This is one reason why I always work with tab delimited files. Having an actual tab character isn't very common in free-text fields, at least in the data that I work with. Commas on the other hand, are quite common. Why one would select a field separator that was common in your data is beyond me (I know it's historical).
Your data files might be different, in which case, maybe you should select a different field separator.
Otherwise, no, there is no work around. If you have to quote fields, then you can't use the normal unix command line tools that tokenize fields.
csvquote allows UNIX tools to work properly with quoted fields that contain delimiters inside the data. It is a simple translation tool that temporarily replaces the special characters occurring inside quotes with harmless non-printing characters. You do it as a first step in the pipeline, then do the regular operations using UNIX tools, and the last step of of the pipeline restores those troublesome characters back inside the data fields.
csvfix is probably the best tool to deal with it. Csvfix, awk, sed are probably my "first line of data-attack". After that usually I can get to analysing, plotting or whatever I need to do.
Practically all Unix tools consider the comma-separated-but-you-can-use-quotes-to-override CSV file to be an abomination. [1] You have to have a crazy regexp to get around it.
I love Unix pipelines, but chances are your data is structured in such a way that using regex based tools will break that structure unless you're very careful.
You know that thing about not making HTML with regexs? Same rule applies to CSV, TSV, and XLSX. All these can be created, manipulated and read using Python, which is probably already on your system.
uniq -u movies.csv > temp.csv
mv temp.csv movie.csv
**Important thing to note here is uniq wont work if duplicate records are not adjacent. [Addition based on HN inputs]
Would the fix here be to sort the lines first using the `sort` command first? Then `uniq`?
To run Unix commands on Terabytes of data, check out https://cloudbash.sh/. In addition to the standard Unix commands, their join, group-By operations are amazing.
We guys are evaluating replacing our entire ETL with cloudbash!
I use this command very frequently to check how often an event occurs in a log file over time (specifically in 10-minute buckets), assuming the file is formatting like "INFO - [2014-08-27 16:16:29,578] Something something something"
"Don't pipe a cat" is how I'm used to describing what you're talking about -- it may have been a performance issue in days past, but these days I think it's simply a matter of style. Not that style is not important.
This was drilled into me back in the usenet days. If you see a cat command with a single argument it's almost always replaceable by a shell redirection, or in this case just by passing the filename as an argument to grep. If you're processing lots of data like in the article there's no point in passing it through a separate command and pipe first.
I think people like reading cat thefile | grepsedawk -opts 'prog' from left to right, and that they think the only alternative is grepsedawk -opts 'prog' thefile.
But there's grep <thefile -opts 're'. I like that one best; it reads the same way you'd tend to think it.
But when you're interactively creating your pipeline, having the cat at the very beginning can save you some shuffling around. For example, while your second command is be grep and you're passing your file directly to grep, you might then realize you need another command before grep. So you'll have to move that argument to the other command. With a useless cat in front, you just insert the new command between cat and grep. It doesn't really cause harm.
That's usually faster where possible, but it may cause problems on large data sets, since it loads the entire set of unique strings (and their counts) into an in-memory hash table.
Immediately ^'d this post for it's usefulness, but I put this in my .bashrc instead:
function body_alias() {
sed -n $1,$2p $3
}
alias body=body_alias
I used to have little scripts like body in my own ~/bin or /usr/local/bin but I've been slowly moving those to my .bashrc which I can copy to new systems I log on to.
Glad you liked it. Your alias technique is good too. Plus it may save a small amount of time since the body script does not have to be loaded from its file (as in my case) - unless your *nix version caches it in memory after the first time.
If you were piping into that bracketed expression (instead of using a real file), you'd need "line", "9 read", "sh -c 'read ln; echo $ln'", or "bash -c 'read; echo $REPLY'" in place of the head since head, sed, or anything else, might use buffered I/O and bite off more than it chews. (and then a plain cat in place of the tail)
"line" will compile anywhere but I only know it to be generally available on Linux. I think it's crazy that such a pipe-friendly way to extract a number of lines, and no more than that, isn't part of some standard.
I may as well plug my little program, which takes numbers read line-by-line in standard input and outputs a live-updating histogram (and some summary statistics) in the console!
"rs" for "reshape array". Found only on FreeBSD systems (yes, we are better... smile)
For example, transpose a text file:
~/ (j=0,r=1)$ cat foo.txt
a b c
d e f
~/ (j=0,r=0)$ cat foo.txt | rs -T
a d
b e
c f
Honestly I have never used in production, but I still think it is way cool.
Also, being forced to work in a non-Unix environment, I am always reminded how much I wish everything were either text files, zipped text files, or a SQL database. I know for really big data (bigger than our typical 10^7 row dataset, like imagery or genetics), you have to expand into things like HDF5, but part of my first data cleaning sequence is often to take something out of Excel or whatever and make a text file from it and apply unix tools.
It's good to note that `uniq -u` does remove duplicates, but it doesn't output any instances of a line which has been duplicated. This is probably not clear to a lot of people reading this.
Just one other thing I'd like to mention before everyone moves on to another topic. Not all of the unix commands are equal, and some have features that others don't.
E.g. I mainly work on AIX, and a lot of the commands are simply not the same as what they are on more standard linux flavors. From what I've heard, this applies between different distros as well.
Not so much the case with standard programming languages that are portable. E.g. Python. Unless you take in to account Jython, etc.
Certain people might miss the point of why to use command line.
1) I use this before using R or Python and ONLY do this when this is something I consistently need to be done all the time. Makes my R scripts shorter.
2) Somethings just need something simple to be fixed and these commands are just great.
Learn awk and sed and your tools just go much larger in munging data.
Exactly! I had a longish period when I wanted to do everything with the same tool. Now, I try to pick the most efficient (for me, not the machine) to do it. Csvfix, awk, sed, jq and several other command line goodies make my life easier, the heavy lifting goes to R, gephi, or some ad-hoc Python, go or C
awk / gawk is super useful. For C/C++ programmers the language is very easy to learn. Try running "info gawk" for a very good guide.
I've used gawk for many things ranging from data analysis to generate linker / loader code in an embedded build environment for a custom processor / sequencer.
(You can even find a version to run from the Windows command prompt if you don't have Cygwin.)
really HN? if you find yourself depending heavily on the recommendations in this article you are doing data analysis wrong. Shell foo is relevant to data analysis only as much as regex is. In the same light depending on these methods too much is digging a deep knowledge ditch that in the end is going to limit and hinder you way more than the initial ingress time required to learn more capable data analytics frameworks or at least a scripting language.
still, on international man page appreciate day this is a great reference. the only thing it is missing is gnuplot ascii graphs.
I think that he actually is a biologist. He refers to movies as a parallel universe. In which case, these tools are probably not all that helpful. Biological data is usually in the scale of either "Excel can handle it" (shudder) or "ginormous".
In the later case, none of these would be all that useful, and CSV is not the standard format for most of the biological data that I see.
Databases are less helpful than you'd imagine for this type of data as the schemas are not well defined. I am curious to know how JSON records would work for these data, because I could see something like that working for processing biological data files.
Hate to sound like Steve-Jobs here, but: "You're using it wrong."
Let me elaborate. If you're coming across limitations of "too-big" or "too-long" in your language of choice: Then you're just a few searches away from both being enlightened on how to solve your task at hand and on how your language works. Both things that will prevent you from being hindered next time around when you have to do a similar big-data job.
Perhaps you are more comfortable using pre-defined lego-blocks to build your logic. Perhaps you understand the unix commands better than you do your chosen language. But understand that programming is the same, just in a different conceptual/knowledge space. And remember, always use the right tool for the job!
(I use Unix commands daily as they're quick/dirty in a jiffy, but for complex tasks I am more productive solving the problem in a language I am comfortable in instead of searching through man pages for obscure flags/functionality)