> If you're using a large genomic dataset, you shouldn't be sorting your results in text format anyway... it would take way too much time and temporary disk space. What you'd probably want is a row filter to extract out the rows of interest.
For repeated queries, this isn't efficient. This is why we have indexed, sorted BAM files compressed with BGZF (and tabix, which uses the same ideas). Many queries in genomics are with respect to position, and this is O(1) with a sorted, indexed file and O(n) with streaming. Streaming also involves reading and uncompressing the entire file from the disk — accessing entries with from an indexed, sorted file involves a seek() to the correct offset and decompressing that particular block – this is far more efficient. I definitely agree that streaming is great, but sorting data is an essential trick in working with large genomics data sets.
I'm well aware of BAM and BGZF. I've even written a parser or two (yay double encoded binary formats). I really like the BGZF format and think it doesn't get used enough outside of sequencing. It's basically gzip frames with enough information to allow random access within the compressed stream. And tabix is a great way to efficiently index otherwise unindexable text files.
However, these are all binary formats. I specifically said that you shouldn't sort genome files in text format. Because while text is easy to parse, binary is faster for this. You aren't going to use any of the standard unix tools once you've crossed over into binary formats. And so you are stuck using domain specific tools. Stuff like http://ngsutils.org (shameless plug).
I have seen people write tools for parsing SAM files using unix core utils, but they are always orders of magnitude slower than a good BAM-specific tool (in almost any language).
For repeated queries, this isn't efficient. This is why we have indexed, sorted BAM files compressed with BGZF (and tabix, which uses the same ideas). Many queries in genomics are with respect to position, and this is O(1) with a sorted, indexed file and O(n) with streaming. Streaming also involves reading and uncompressing the entire file from the disk — accessing entries with from an indexed, sorted file involves a seek() to the correct offset and decompressing that particular block – this is far more efficient. I definitely agree that streaming is great, but sorting data is an essential trick in working with large genomics data sets.