"While dealing with big genetic data sets I often got stuck with limitation of p...

epistasis · on Aug 27, 2014

Yeah, I think calling him the "unenlightened" one is pretty off base here.

For performing the tasks outlined by his examples, Unix utilities are easier for the user as well as executing faster than writing your own code in a general purpose programming language, unless one puts in the time to tune the implementation.

One could rebuild AWK in C and get similar performance, but why not just use some extremely simple AWK? And is anybody going to replicate the amazing speed of grep without a huge time investment? [1]

There is a right tool for the job, and having seen dozens of programmers be exposed to new big data sets, I can tell you that the ones who become productive quickly and stay more productive are the ones who adopt whatever tool is best for the job, not the ones that stick to their one favorite programming language. In fact, a good sign of somebody who will quickly fail is someone who says "forget those Unix tools, I'm just going to write this in X".

[1] http://ridiculousfish.com/blog/posts/old-age-and-treachery.h...

derekp7 · on Aug 27, 2014

This is one area where I wish the Unix philosophy (reuse of tools) was taken a bit further. Too me, every command should be callable as a C library function. That way you wouldn't have to parse the human readable output through a pipe. Not only that, there needs to be both human-readable, as well as machine-readable output to all commands. For example I would love to be able to call "ps" from another script and easily select specific columns from an xml or json output.

sanderjd · on Aug 27, 2014

PowerShell solves this problem by piping around objects instead of strings. It's pretty neat!

laumars · on Aug 27, 2014

Shell scripts can be pretty powerful if you know what you're doing, but I do agree that sometimes the shell script paradigm can be more of a hurdle than a help.

However your point about every command being a callable as a C library is kind of possible already. Some commands do have native language libraries (eg libcurl), but you could also fork out to those ELFs if you're feeling really brave (though in all practicality - it's little worse than writing a shell script to begin with). In fact there's times I've been known to cheat with Perl and run (for example):

    (my $hostname = `hostname`) =~ s/\n//g;

because it's quicker and easier to throw together than using the proper Perl libraries (yeah, it's pretty nasty from an academic perspective, but the additional footprint is minimal while the development time saved is significant.

Of course, any such code that's used regularly and/or depended on will be cleaned up as and when I have the time.

As for your XML or JSON parsing; the same theory as above could be applied:

    use JSON::Parse 'parse_json';
    my $json = `curl --silent http://birthdays.com/myfriends.json`;
    my $bdays = parse_json($json);
    print "derekp7's birthday is $bdays{derekp7}";

Obviously these aren't best practices, but if it only running locally (ie this isn't part of a CGI (etc) script that's web accessible) and gets the job done in a hurry then I can't see why I you shouldn't use that for ad hoc reporting.

rmc · on Aug 28, 2014

But one of the unix philosophies is to use plain text. To have everything as a C function means everything needs a new API

barrkel · on Aug 27, 2014

Most scripting languages aren't multithreaded, and some aren't pipeline oriented by default.

For example, working with file lines naively in Ruby means reading the whole lot into a giant array and doing transformations an array at a time, rather than in a streaming fashion.

The shell gives you fairly safe concurrency and streaming for free.

Personally, if it's a complex task, I generally write a tool such that it can be put into a shell pipeline.

Knowing the command line well - so that you don't often have to look up man pages for obscure flags / functionality - has its own rewards, as these commands turn into something you use all the time in the terminal. Rather than spending a few minutes developing a script in an editor, you can incrementally build a pipeline over a few seconds. Doing your script in a REPL is a better approximation, but it's a bit less immediate.

sorentwo · on Aug 27, 2014

You don't have to read all of a file into memory in Ruby. There are a number of facilities for reading only a portion of a file, readpartial[1] for example. Additionally, you have access to all of the native pipe[2] functionality as well. There are plenty of reasons to favor shell tools over Ruby, but those aren't some of them.

[1]: http://www.ruby-doc.org/core-2.1.2/IO.html#method-i-readpart... [2]: http://www.ruby-doc.org/core-2.1.2/IO.html#method-c-popen

dpeck · on Aug 27, 2014

No the case with ruby at all, if you're reading the whole file into memory theres a good chance you're doing it wrong.

check out yield and blocks

barrkel · on Aug 27, 2014

The problem is that the most obvious way of doing it - File.readlines('foo.txt').map { ... }.select { ... } etc. - is not stream-oriented.

riffraff · on Aug 27, 2014

arguably, it's trivial to make that stream oriented

    open('tmp.rb').each_line.lazy.map {...}.select {...}

the problem with processing big files with ruby (in my humble experience) is usually that it's still slow enough that "preprocessing with grep&uniq" is worthwhile.

barrkel · on Aug 28, 2014

    > open('tmp.rb').each_line.lazy
    NoMethodError: undefined method `lazy' for #<Enumerator: #<File:Procfile>:each_line>

Not everybody is using Ruby 2.0.

Terretta · on Aug 27, 2014

> if you're reading the whole file into memory theres a good chance you're doing it wrong

GP: "working with file lines _naively_ in Ruby"

dpeck · on Aug 27, 2014

ah, my bad, read that as natively and chalked it up to odd wording.

georgegeorges · on Aug 27, 2014

> always use the right tool for the job

Standard grep is much faster on multi-gigabyte files than anything you can figure out how to do in your pet language. By the time you get close to matching grep, you would have reimplemented most of grep, in half-assed fashion at that.

Your delusion is assuming standard command line tools are simple in function because they have a simple interface that Average Joe can use.

namplaa · on Aug 27, 2014

Personally I find silversearcher (ag) faster, and reinventing standard commandline tools with a collection of other tools is often slower.

One liner shell commands often turn complicated quickly.

_pvxk · on Aug 27, 2014

ag is often faster when you're using it interactively, replacing "grep -r" (in particular in version controlled dirs). It's also faster in the sense that for interactive use it will often DWYM.

But has too many weird quirks that it can replace grep for data munging. E.g.

    $ ag verb fullform_nn.txt >/tmp/verbs
    ERR: Too many matches in fullform_nn.txt. Skipping the rest of this file.

Man ag says there's a --max-count option. Let's try that.

    $ grep -c verb fullform_nn.txt
    206077
    $ ag --max-count 206077 verb fullform_nn.txt >/tmp/verbs
    ERR: Too many matches in fullform_nn.txt. Skipping the rest of this file.

Wtf? (and running those two commands with "time" gave ag user 0m0.770s while grep had user 0m0.057s)

andreasvc · on Aug 27, 2014

Did you report it as a bug?

_pvxk · on Aug 28, 2014

did now https://github.com/ggreer/the_silver_searcher/issues/483 (though it took me a while to figure out a more precise issue title than "--max-count is taunting me")

xorcist · on Aug 27, 2014

I have never used ag, but in most instances where people thought they made a faster grep it is becuase it doesn't handle multibyte encodings correctly.

Have you rerun your tests in the "C" locale?

gopalv · on Aug 27, 2014

I would allow that for some definition of "too-big".

I wrote a distributed grep impl a few years back to grep my logs and collect output to a central machine (a vague "how may machines had this error" job).

The central orchestration was easy in python, but implementing

zgrep | awk | sort | uniq -c | wc -l

is way faster and way more code in python than to do it with shell (zgrep is awesome for .gz logs).

On the other hand, the shell co-ordinator way way harder using pdsh that I reverted to using paramiko and python threadpools.

Unix tools are extremely composable and present in nearly every machine with the standard behaviour.

vsbuffalo · on Aug 27, 2014

> Hate to sound like Steve-Jobs here, but: "You're using it wrong."

I don't quite agree — say this individual needs to sort a file by two columns. Should they really load everything into memory to call Python's sorted()? With large genomics datasets this isn't possible. Trying to reimplement sort's on-disk merge sort would be unnecessary and treacherous.

It's easy to forget how much engineer went into these core utilities — which can be useful when working with big files.

mbreese · on Aug 27, 2014

It's not hard to write an on-disk merge sort using Python... it just may not be that fast.

But really, as I'm sure you know, for genome-scale datasets, the key word is streaming. Disk IO is a major bottleneck. If you're using a large genomic dataset, you shouldn't be sorting your results in text format anyway... it would take way too much time and temporary disk space. What you'd probably want is a row filter to extract out the rows of interest. Or, you'd be calculating some other kind of non-trivial summary statistics. In both of these cases, you'd need to use some kind of custom program. But you'd still should be operating on the stream, not the entire dataset.

(Of the top of my head I can think of only a few instances where you'd need to operate on a column as opposed to a row in genome data - multiple testing correction being the main one)

If you need to sort by two columns, yes, by all means use "sort". It's about as fast as you are going to get. But for "exploratory analysis" on genomic data, you'd better have a really good reason (or small dataset) to use these tools.

vsbuffalo · on Aug 27, 2014

> If you're using a large genomic dataset, you shouldn't be sorting your results in text format anyway... it would take way too much time and temporary disk space. What you'd probably want is a row filter to extract out the rows of interest.

For repeated queries, this isn't efficient. This is why we have indexed, sorted BAM files compressed with BGZF (and tabix, which uses the same ideas). Many queries in genomics are with respect to position, and this is O(1) with a sorted, indexed file and O(n) with streaming. Streaming also involves reading and uncompressing the entire file from the disk — accessing entries with from an indexed, sorted file involves a seek() to the correct offset and decompressing that particular block – this is far more efficient. I definitely agree that streaming is great, but sorting data is an essential trick in working with large genomics data sets.

mbreese · on Aug 28, 2014

I'm well aware of BAM and BGZF. I've even written a parser or two (yay double encoded binary formats). I really like the BGZF format and think it doesn't get used enough outside of sequencing. It's basically gzip frames with enough information to allow random access within the compressed stream. And tabix is a great way to efficiently index otherwise unindexable text files.

However, these are all binary formats. I specifically said that you shouldn't sort genome files in text format. Because while text is easy to parse, binary is faster for this. You aren't going to use any of the standard unix tools once you've crossed over into binary formats. And so you are stuck using domain specific tools. Stuff like http://ngsutils.org (shameless plug).

I have seen people write tools for parsing SAM files using unix core utils, but they are always orders of magnitude slower than a good BAM-specific tool (in almost any language).

zackbloom · on Aug 27, 2014

Just to point out, "big genetic datasets" can be hundreds of gigabytes to terabytes, sizes which are not possible to process easily in any language. A highly optimized C program like uniq might be the fastest option short of a distributed system.

aks_c · on Aug 27, 2014

When you have to deal with (genetic data) files of few GB on daily basis, I dont think using Python, R or databases is good idea to do basic data exploring.

-rwxr-x--- 1 29594528008 out_chr1comb.dose

-rwxr-x--- 1 27924241334 out_chr2comb.dose

-rwxr-x--- 1 25684164559 out_chr3comb.dose

-rwxr-x--- 1 24665680612 out_chr4comb.dose

-rwxr-x--- 1 21493584686 out_chr5comb.dose

-rwxr-x--- 1 23626967979 out_chr6comb.dose

-rwxr-x--- 1 20856136599 out_chr7comb.dose

-rwxr-x--- 1 18398180426 out_chr8comb.dose

-rwxr-x--- 1 15864714472 out_chr9comb.dose

zo1 · on Aug 27, 2014

As someone that deals with large datasets on a Database + Python daily, I'm not quite sure what you mean. You'll have to explain it to me what "not a good idea is", or "basic data exploring".

aks_c · on Aug 27, 2014

Consider I get 10 files of size 3 GB every week, which I am supposed to filter based on certain column using a reference index and forward to my colleague. Before filtering I also want to check how the file looks like: column names, first few records etc.

I can use something like following to explore few rows and few columns. $$ awk '{print $1,$3,$5}' file | head -10

And then I can use something like sed with reference index to filter the file. Since I plan to repeat this with different files, databases would be time consuming(even if I automate it loading every file and querying). Due to the file size options like R, Python would be slower than unix commands. I can also save set of commands as script and share/run whenever I need it.

If there is a better way I would be happy to learn.

jbooth · on Aug 27, 2014

I think the gain you're seeing there is because it's quicker for you to do quick, dirty ad hoc work with the shell than it is to write custom python for each file. Which totally makes sense, the work's ad hoc so use an ad hoc tool. Python being slow and grep being a marvel of optimization doesn't really matter, here, compared to the dev time you're saving.

collyw · on Aug 27, 2014

I have been doing Python the last few years, but went back to Perl for this sort of thing recently. You can start with a one liner, and if it gets complicated, just turn it into a proper script. As well as the Unix commands mentioned. Its just faster when you don't know what you are dealing with yet.

gaius · on Aug 28, 2014

For this kind of thing, it's easiest to bulk-load them into SQLite and do your exploration and early analysis in SQL