View Single Post
Old 03-06-2012, 08:57 AM   #15
Super Moderator
Location: US

Join Date: Nov 2009
Posts: 437

I think it is worth figuring out the best way to compress/decompress. Our nodes have 64 cores so I will do some tests and see how BZIP2 and GZIP scale. I'll post what I find on this thread.

In the meantime a quick internet search turned up this:

Originally Posted by lh3 View Post
Firstly I greatly appreciate and strongly support Nils effort in multithreading samtools. The change is likely to be merged to samtools.

Re sorting algorithm: samtools sort does stable sorting (i.e. preserving the relative order of records having the same coordinate). In some rare/non-typical use cases, this feature is useful. Merge sort is stable. Introsort is not.

Re pigz: someone told me on biostar that pigz is not very scalable with many cores. If this is true (I have not tried), this must be because the gzip format has long range dependencies. bzip2 and bgzip are much easier to parallelize and probably more scalable. In addition, bzip2 has a parallel version pbzip2 which the same person told me scales very well with the number of CPU cores.

Re bzip2: I have argued a couple times here (years ago) and also on the samtools list that the key reason samtools uses gzip instead of bzip2 is because gzip is 5-10X faster on decompression. With bzip2, most samtools command will be 2-10 times slower. I think for huge data sets that need to be read frequently, gzip is always preferred over bzip2.
adaptivegenome is offline   Reply With Quote