"While dealing with big genetic data sets I often got stuck with limitation of programming languages in terms of reading big files."
Hate to sound like Steve-Jobs here, but: "You're using it wrong."
Let me elaborate. If you're coming across limitations of "too-big" or "too-long" in your language of choice: Then you're just a few searches away from both being enlightened on how to solve your task at hand and on how your language works. Both things that will prevent you from being hindered next time around when you have to do a similar big-data job.
Perhaps you are more comfortable using pre-defined lego-blocks to build your logic. Perhaps you understand the unix commands better than you do your chosen language. But understand that programming is the same, just in a different conceptual/knowledge space. And remember, always use the right tool for the job!
(I use Unix commands daily as they're quick/dirty in a jiffy, but for complex tasks I am more productive solving the problem in a language I am comfortable in instead of searching through man pages for obscure flags/functionality)
Yeah, I think calling him the "unenlightened" one is pretty off base here.
For performing the tasks outlined by his examples, Unix utilities are easier for the user as well as executing faster than writing your own code in a general purpose programming language, unless one puts in the time to tune the implementation.
One could rebuild AWK in C and get similar performance, but why not just use some extremely simple AWK? And is anybody going to replicate the amazing speed of grep without a huge time investment? [1]
There is a right tool for the job, and having seen dozens of programmers be exposed to new big data sets, I can tell you that the ones who become productive quickly and stay more productive are the ones who adopt whatever tool is best for the job, not the ones that stick to their one favorite programming language. In fact, a good sign of somebody who will quickly fail is someone who says "forget those Unix tools, I'm just going to write this in X".
This is one area where I wish the Unix philosophy (reuse of tools) was taken a bit further. Too me, every command should be callable as a C library function. That way you wouldn't have to parse the human readable output through a pipe. Not only that, there needs to be both human-readable, as well as machine-readable output to all commands. For example I would love to be able to call "ps" from another script and easily select specific columns from an xml or json output.
Shell scripts can be pretty powerful if you know what you're doing, but I do agree that sometimes the shell script paradigm can be more of a hurdle than a help.
However your point about every command being a callable as a C library is kind of possible already. Some commands do have native language libraries (eg libcurl), but you could also fork out to those ELFs if you're feeling really brave (though in all practicality - it's little worse than writing a shell script to begin with). In fact there's times I've been known to cheat with Perl and run (for example):
(my $hostname = `hostname`) =~ s/\n//g;
because it's quicker and easier to throw together than using the proper Perl libraries (yeah, it's pretty nasty from an academic perspective, but the additional footprint is minimal while the development time saved is significant.
Of course, any such code that's used regularly and/or depended on will be cleaned up as and when I have the time.
As for your XML or JSON parsing; the same theory as above could be applied:
use JSON::Parse 'parse_json';
my $json = `curl --silent http://birthdays.com/myfriends.json`;
my $bdays = parse_json($json);
print "derekp7's birthday is $bdays{derekp7}";
Obviously these aren't best practices, but if it only running locally (ie this isn't part of a CGI (etc) script that's web accessible) and gets the job done in a hurry then I can't see why I you shouldn't use that for ad hoc reporting.
Most scripting languages aren't multithreaded, and some aren't pipeline oriented by default.
For example, working with file lines naively in Ruby means reading the whole lot into a giant array and doing transformations an array at a time, rather than in a streaming fashion.
The shell gives you fairly safe concurrency and streaming for free.
Personally, if it's a complex task, I generally write a tool such that it can be put into a shell pipeline.
Knowing the command line well - so that you don't often have to look up man pages for obscure flags / functionality - has its own rewards, as these commands turn into something you use all the time in the terminal. Rather than spending a few minutes developing a script in an editor, you can incrementally build a pipeline over a few seconds. Doing your script in a REPL is a better approximation, but it's a bit less immediate.
You don't have to read all of a file into memory in Ruby. There are a number of facilities for reading only a portion of a file, readpartial[1] for example. Additionally, you have access to all of the native pipe[2] functionality as well. There are plenty of reasons to favor shell tools over Ruby, but those aren't some of them.
the problem with processing big files with ruby (in my humble experience) is usually that it's still slow enough that "preprocessing with grep&uniq" is worthwhile.
Standard grep is much faster on multi-gigabyte files than anything you can figure out how to do in your pet language. By the time you get close to matching grep, you would have reimplemented most of grep, in half-assed fashion at that.
Your delusion is assuming standard command line tools are simple in function because they have a simple interface that Average Joe can use.
ag is often faster when you're using it interactively, replacing "grep -r" (in particular in version controlled dirs). It's also faster in the sense that for interactive use it will often DWYM.
But has too many weird quirks that it can replace grep for data munging. E.g.
$ ag verb fullform_nn.txt >/tmp/verbs
ERR: Too many matches in fullform_nn.txt. Skipping the rest of this file.
Man ag says there's a --max-count option. Let's try that.
$ grep -c verb fullform_nn.txt
206077
$ ag --max-count 206077 verb fullform_nn.txt >/tmp/verbs
ERR: Too many matches in fullform_nn.txt. Skipping the rest of this file.
Wtf? (and running those two commands with "time" gave ag user 0m0.770s while grep had user 0m0.057s)
I have never used ag, but in most instances where people thought they made a faster grep it is becuase it doesn't handle multibyte encodings correctly.
I would allow that for some definition of "too-big".
I wrote a distributed grep impl a few years back to grep my logs and collect output to a central machine (a vague "how may machines had this error" job).
The central orchestration was easy in python, but implementing
zgrep | awk | sort | uniq -c | wc -l
is way faster and way more code in python than to do it with shell (zgrep is awesome for .gz logs).
On the other hand, the shell co-ordinator way way harder using pdsh that I reverted to using paramiko and python threadpools.
Unix tools are extremely composable and present in nearly every machine with the standard behaviour.
> Hate to sound like Steve-Jobs here, but: "You're using it wrong."
I don't quite agree — say this individual needs to sort a file by two columns. Should they really load everything into memory to call Python's sorted()? With large genomics datasets this isn't possible. Trying to reimplement sort's on-disk merge sort would be unnecessary and treacherous.
It's easy to forget how much engineer went into these core utilities — which can be useful when working with big files.
It's not hard to write an on-disk merge sort using Python... it just may not be that fast.
But really, as I'm sure you know, for genome-scale datasets, the key word is streaming. Disk IO is a major bottleneck. If you're using a large genomic dataset, you shouldn't be sorting your results in text format anyway... it would take way too much time and temporary disk space. What you'd probably want is a row filter to extract out the rows of interest. Or, you'd be calculating some other kind of non-trivial summary statistics. In both of these cases, you'd need to use some kind of custom program. But you'd still should be operating on the stream, not the entire dataset.
(Of the top of my head I can think of only a few instances where you'd need to operate on a column as opposed to a row in genome data - multiple testing correction being the main one)
If you need to sort by two columns, yes, by all means use "sort". It's about as fast as you are going to get. But for "exploratory analysis" on genomic data, you'd better have a really good reason (or small dataset) to use these tools.
> If you're using a large genomic dataset, you shouldn't be sorting your results in text format anyway... it would take way too much time and temporary disk space. What you'd probably want is a row filter to extract out the rows of interest.
For repeated queries, this isn't efficient. This is why we have indexed, sorted BAM files compressed with BGZF (and tabix, which uses the same ideas). Many queries in genomics are with respect to position, and this is O(1) with a sorted, indexed file and O(n) with streaming. Streaming also involves reading and uncompressing the entire file from the disk — accessing entries with from an indexed, sorted file involves a seek() to the correct offset and decompressing that particular block – this is far more efficient. I definitely agree that streaming is great, but sorting data is an essential trick in working with large genomics data sets.
I'm well aware of BAM and BGZF. I've even written a parser or two (yay double encoded binary formats). I really like the BGZF format and think it doesn't get used enough outside of sequencing. It's basically gzip frames with enough information to allow random access within the compressed stream. And tabix is a great way to efficiently index otherwise unindexable text files.
However, these are all binary formats. I specifically said that you shouldn't sort genome files in text format. Because while text is easy to parse, binary is faster for this. You aren't going to use any of the standard unix tools once you've crossed over into binary formats. And so you are stuck using domain specific tools. Stuff like http://ngsutils.org (shameless plug).
I have seen people write tools for parsing SAM files using unix core utils, but they are always orders of magnitude slower than a good BAM-specific tool (in almost any language).
Just to point out, "big genetic datasets" can be hundreds of gigabytes to terabytes, sizes which are not possible to process easily in any language. A highly optimized C program like uniq might be the fastest option short of a distributed system.
When you have to deal with (genetic data) files of few GB on daily basis, I dont think using Python, R or databases is good idea to do basic data exploring.
As someone that deals with large datasets on a Database + Python daily, I'm not quite sure what you mean. You'll have to explain it to me what "not a good idea is", or "basic data exploring".
Consider I get 10 files of size 3 GB every week, which I am supposed to filter based on certain column using a reference index and forward to my colleague. Before filtering I also want to check how the file looks like: column names, first few records etc.
I can use something like following to explore few rows and few columns.
$$ awk '{print $1,$3,$5}' file | head -10
And then I can use something like sed with reference index to filter the file. Since I plan to repeat this with different files, databases would be time consuming(even if I automate it loading every file and querying). Due to the file size options like R, Python would be slower than unix commands. I can also save set of commands as script and share/run whenever I need it.
If there is a better way I would be happy to learn.
I think the gain you're seeing there is because it's quicker for you to do quick, dirty ad hoc work with the shell than it is to write custom python for each file. Which totally makes sense, the work's ad hoc so use an ad hoc tool. Python being slow and grep being a marvel of optimization doesn't really matter, here, compared to the dev time you're saving.
I have been doing Python the last few years, but went back to Perl for this sort of thing recently. You can start with a one liner, and if it gets complicated, just turn it into a proper script. As well as the Unix commands mentioned. Its just faster when you don't know what you are dealing with yet.
Hate to sound like Steve-Jobs here, but: "You're using it wrong."
Let me elaborate. If you're coming across limitations of "too-big" or "too-long" in your language of choice: Then you're just a few searches away from both being enlightened on how to solve your task at hand and on how your language works. Both things that will prevent you from being hindered next time around when you have to do a similar big-data job.
Perhaps you are more comfortable using pre-defined lego-blocks to build your logic. Perhaps you understand the unix commands better than you do your chosen language. But understand that programming is the same, just in a different conceptual/knowledge space. And remember, always use the right tool for the job!
(I use Unix commands daily as they're quick/dirty in a jiffy, but for complex tasks I am more productive solving the problem in a language I am comfortable in instead of searching through man pages for obscure flags/functionality)