View Single Post
Old 03-06-2012, 08:49 AM   #14
Senior Member
Location: Boston

Join Date: Feb 2008
Posts: 693

Firstly I greatly appreciate and strongly support Nils effort in multithreading samtools. The change is likely to be merged to samtools.

Re sorting algorithm: samtools sort does stable sorting (i.e. preserving the relative order of records having the same coordinate). In some rare/non-typical use cases, this feature is useful. Merge sort is stable. Introsort is not.

Re pigz: someone told me on biostar that pigz is not very scalable with many cores. If this is true (I have not tried), this must be because the gzip format has long range dependencies. bzip2 and bgzip are much easier to parallelize and probably more scalable. In addition, bzip2 has a parallel version pbzip2 which the same person told me scales very well with the number of CPU cores.

Re bzip2: I have argued a couple times here (years ago) and also on the samtools list that the key reason samtools uses gzip instead of bzip2 is because gzip is 5-10X faster on decompression. With bzip2, most samtools command will be 2-10 times slower. I think for huge data sets that need to be read frequently, gzip is always preferred over bzip2.

Last edited by lh3; 03-06-2012 at 08:53 AM.
lh3 is offline   Reply With Quote