> If you're using a large genomic dataset, you shouldn't be sorting your results...

mbreese · on Aug 28, 2014

I'm well aware of BAM and BGZF. I've even written a parser or two (yay double encoded binary formats). I really like the BGZF format and think it doesn't get used enough outside of sequencing. It's basically gzip frames with enough information to allow random access within the compressed stream. And tabix is a great way to efficiently index otherwise unindexable text files.

However, these are all binary formats. I specifically said that you shouldn't sort genome files in text format. Because while text is easy to parse, binary is faster for this. You aren't going to use any of the standard unix tools once you've crossed over into binary formats. And so you are stuck using domain specific tools. Stuff like http://ngsutils.org (shameless plug).

I have seen people write tools for parsing SAM files using unix core utils, but they are always orders of magnitude slower than a good BAM-specific tool (in almost any language).