SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
How to extract multi-mapped reads by samtools? mavishou RNA Sequencing 5 12-05-2016 05:27 AM
Compile multi-individual SNP tables from bowtie/samtools mappings KNS Bioinformatics 10 12-02-2011 08:40 AM
RTG Investigator 2.3: New Ion Torrent support, faster Complete Genomics mapping Stuart Inglis Vendor Forum 0 08-31-2011 01:27 PM
samtools flagstats bug or it does not support multi-reads? xinwu Bioinformatics 0 12-22-2010 09:33 PM
Cuffdiff multi-protein vs multi-promoter RockChalkJayhawk Bioinformatics 2 03-26-2010 10:26 AM

Reply
 
Thread Tools
Old 03-04-2012, 06:47 AM   #1
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Thumbs up Multi-threaded (faster) SAMtools

I have been working on speeding up reading and writing within SAMtools by creating a multi-threaded block gzip reader and writer. I have an alpha version working. I would appreciate some feedback and testing, just don't use it for production systems yet. Thank-you!

http://github.com/nh13/samtools/tree/pbgzip

NB: I would be happy describe the implementation, and collaborate to get this into Picard too.
nilshomer is offline   Reply With Quote
Old 03-04-2012, 08:03 AM   #2
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 700
Default

Any benchmarks?
Richard Finney is offline   Reply With Quote
Old 03-04-2012, 08:38 AM   #3
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Copied here from http://sourceforge.net/mailarchive/m...sg_id=28915492

I am working on benchmarking the samtools commands today, and will post back.

Quote:
A 4GB SAM file was used on a dual-hex-core (12 cores) computer. I
benchmarked compression then decompression, making sure the resulting files
were the same. Decompression seems to be limited by IO.

Name Compression Time Decompression Time
bgzip 485.64 39.93
pbgzip -n 1 481.57 40.02
pbgzip -n 2 240.85 41.03
pbgzip -n 4 122.05 41.79
pbgzip -n 8 63.17 41.17
pbgzip -n 12 43.12 41.65
pbgzip -n 16 39.59 41.48
pbgzip -n 20 37.03 42.41
pbgzip -n 24 34.90 47.24
nilshomer is offline   Reply With Quote
Old 03-04-2012, 09:17 PM   #4
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Updated #s on a few commands:
Command samtools psamtools
view BAM 29.45 19.2
view -b BAM 207.51 19.36
view -S SAM 44.89 44.43
view -Sb SAM 222.64 32.62
sort 206.32 25.17
mpileup 6574.2 7252.08
depth 17.64 7.47
index 11.96 1.93
flagstat 11.73 1.73
calmd -b 209.25 22.86
rmdup -s 154.88 22.08
reheader 0.76 0.74
cat 1.54 1.37
nilshomer is offline   Reply With Quote
Old 03-05-2012, 09:53 AM   #5
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 700
Default

Looks good!
question
1) why is mpileup slower?
Richard Finney is offline   Reply With Quote
Old 03-05-2012, 12:11 PM   #6
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Working on it. I am doing this in my free time, so having one perform worse isn't that bad so far.
nilshomer is offline   Reply With Quote
Old 03-05-2012, 01:03 PM   #7
krobison
Senior Member
 
Location: Boston area

Join Date: Nov 2007
Posts: 747
Default

Really cool!!

Do you have benchmarks for retrieving specific reads for a region? For mpileup of a specific region or a list of targets?

Any idea if this will work with the Bio:B::Sam perl module (which must be linked in to samtools)

What are the prospects for merging this with the main samtools development?
krobison is offline   Reply With Quote
Old 03-05-2012, 01:08 PM   #8
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

The seeks are just as fast, so no speedup/slowdown on seeking, but then there should speedup reading from that point on, assuming there are at least a basal number of reads in the region (otherwise there is no work to be done). For mpileup, it doesn't process the regions in parallel, if that is what you were implying.

I posted to the samtools list with response, so I have no hypothesis as to the inclusion of this (of course it needs more testing first). It generally is difficult to get things included there. I have more hope for Picard.

Pysam and the SAM perl module should not notice the difference in the API, though there is no good mechanism yet for determining the # of threads to use (it autodetects the # of cores).
nilshomer is offline   Reply With Quote
Old 03-05-2012, 01:12 PM   #9
krobison
Senior Member
 
Location: Boston area

Join Date: Nov 2007
Posts: 747
Default

I also see the sort command now gives an option to pick an algorithm. What a blast to the past?

Any heuristics on what algorithm might perform better in what setting?

And why no bubble sort option :-)
krobison is offline   Reply With Quote
Old 03-05-2012, 02:18 PM   #10
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by krobison View Post
I also see the sort command now gives an option to pick an algorithm. What a blast to the past?

Any heuristics on what algorithm might perform better in what setting?

And why no bubble sort option :-)
I just used Heng's ksort.h library. I like introsort, but mergesort is by default in the original samtools.

I have also been toying with multi-threaded sort, which sort of works in the new version, except I didn't take time to do a proper multi-way merge (one implementation requires the calculation of evenly spaced pivots). Maybe wait a few more weekends.
nilshomer is offline   Reply With Quote
Old 03-05-2012, 04:45 PM   #11
adaptivegenome
Super Moderator
 
Location: US

Join Date: Nov 2009
Posts: 437
Default

Quote:
Originally Posted by nilshomer View Post
I have been working on speeding up reading and writing within SAMtools by creating a multi-threaded block gzip reader and writer. I have an alpha version working. I would appreciate some feedback and testing, just don't use it for production systems yet. Thank-you!

http://github.com/nh13/samtools/tree/pbgzip

NB: I would be happy describe the implementation, and collaborate to get this into Picard too.
So are you saying you made a parallelized version of BZIP2? We have also been playing around with this. We parallelized the compression and decompression steps in the read/write functions of samtools for a local realignment tool we built.

I would love to learn more about what you are doing as I would hate to duplicate anything you are going to already do!
adaptivegenome is offline   Reply With Quote
Old 03-06-2012, 06:25 AM   #12
colindaven
Senior Member
 
Location: Germany

Join Date: Oct 2008
Posts: 415
Default

As far as parallel (g)zip goes pigz works wonders : http://zlib.net/pigz/
colindaven is offline   Reply With Quote
Old 03-06-2012, 06:42 AM   #13
adaptivegenome
Super Moderator
 
Location: US

Join Date: Nov 2009
Posts: 437
Default

PIGZ is very very fast however it produces file sizes that are much larger than BZIP2. Is this your experience as well?

It would be really nice to be able to simply parallelize BZIP2. We have tried to do this a little bit but certainly don't have completed product yet.
adaptivegenome is offline   Reply With Quote
Old 03-06-2012, 08:49 AM   #14
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

Firstly I greatly appreciate and strongly support Nils effort in multithreading samtools. The change is likely to be merged to samtools.

Re sorting algorithm: samtools sort does stable sorting (i.e. preserving the relative order of records having the same coordinate). In some rare/non-typical use cases, this feature is useful. Merge sort is stable. Introsort is not.

Re pigz: someone told me on biostar that pigz is not very scalable with many cores. If this is true (I have not tried), this must be because the gzip format has long range dependencies. bzip2 and bgzip are much easier to parallelize and probably more scalable. In addition, bzip2 has a parallel version pbzip2 which the same person told me scales very well with the number of CPU cores.

Re bzip2: I have argued a couple times here (years ago) and also on the samtools list that the key reason samtools uses gzip instead of bzip2 is because gzip is 5-10X faster on decompression. With bzip2, most samtools command will be 2-10 times slower. I think for huge data sets that need to be read frequently, gzip is always preferred over bzip2.

Last edited by lh3; 03-06-2012 at 08:53 AM.
lh3 is offline   Reply With Quote
Old 03-06-2012, 08:57 AM   #15
adaptivegenome
Super Moderator
 
Location: US

Join Date: Nov 2009
Posts: 437
Default

I think it is worth figuring out the best way to compress/decompress. Our nodes have 64 cores so I will do some tests and see how BZIP2 and GZIP scale. I'll post what I find on this thread.

In the meantime a quick internet search turned up this:
http://nerdbynature.de/s9y/?251


Quote:
Originally Posted by lh3 View Post
Firstly I greatly appreciate and strongly support Nils effort in multithreading samtools. The change is likely to be merged to samtools.

Re sorting algorithm: samtools sort does stable sorting (i.e. preserving the relative order of records having the same coordinate). In some rare/non-typical use cases, this feature is useful. Merge sort is stable. Introsort is not.

Re pigz: someone told me on biostar that pigz is not very scalable with many cores. If this is true (I have not tried), this must be because the gzip format has long range dependencies. bzip2 and bgzip are much easier to parallelize and probably more scalable. In addition, bzip2 has a parallel version pbzip2 which the same person told me scales very well with the number of CPU cores.

Re bzip2: I have argued a couple times here (years ago) and also on the samtools list that the key reason samtools uses gzip instead of bzip2 is because gzip is 5-10X faster on decompression. With bzip2, most samtools command will be 2-10 times slower. I think for huge data sets that need to be read frequently, gzip is always preferred over bzip2.
adaptivegenome is offline   Reply With Quote
Old 03-06-2012, 10:33 AM   #16
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

I guess that benchmark is non-typical. It is not frequent to find a file that can be compressed from 15GB to 600MB. Nonetheless, it does indicate that pigz is not scalable. Nils' pbgzip should be much better. Also, if you want to do comparison, there is another more modern variant of bzip2 that is both much faster and achieves a better compression ratio. I forgot its name. James Bonfield should know better.
lh3 is offline   Reply With Quote
Old 03-06-2012, 10:39 AM   #17
adaptivegenome
Super Moderator
 
Location: US

Join Date: Nov 2009
Posts: 437
Default

Quote:
Originally Posted by lh3 View Post
I guess that benchmark is non-typical. It is not frequent to find a file that can be compressed from 15GB to 600MB. Nonetheless, it does indicate that pigz is not scalable. Nils' pbgzip should be much better. Also, if you want to do comparison, there is another more modern variant of bzip2 that is both much faster and achieves a better compression ratio. I forgot its name. James Bonfield should know better.
Heng,

You are right. I will give this a try using a SAM file. I wonder if the 15GB file was made by duplicating some content over and over. This would explain the compression.
adaptivegenome is offline   Reply With Quote
Old 03-06-2012, 03:09 PM   #18
adaptivegenome
Super Moderator
 
Location: US

Join Date: Nov 2009
Posts: 437
Default

Guys,

Below are compression times for a 6.8GB SAM file. Tested on Ubuntu 11.10 with latest versions of all software. We got the latest source for each tool and compiled it on our node. Our node has 128GB of RAM and 4x AMD Opteron(TM) Processors. Total of 64 cores.


cores pigz pbzip2 gzip bzip2
2 10m32s 9m52s xx xx
16 1m25s 1m36s xx xx
64 1m6s 0m34s xx xx
1 xx xx 21m18s 19m16s

The pbzip file was 1.7GB and the pigz file was 2GB so not as big as difference as I thought.
adaptivegenome is offline   Reply With Quote
Old 03-06-2012, 04:21 PM   #19
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

It should not be too hard to make a bz2 BAM file, using the bz2 library: BZ2_bzBuffToBuffCompress and BZ2_bzBuffToBuffDecompress. Of course, there are better methods than just using the aforementioned functions (see pbzip2).

I am not sure how necessary all the signalling is in the current implementation, but debugging race conditions is a pain.
nilshomer is offline   Reply With Quote
Old 03-06-2012, 04:26 PM   #20
adaptivegenome
Super Moderator
 
Location: US

Join Date: Nov 2009
Posts: 437
Default

But is it worth it? BZIP wins on parallelization with lots of cores, but is this useful? I thought samtools reads and write in small blocks that are separately compressed and decompressed. So it seems you can just parallelize that, right? Do you really benefit from using BZIP over GZIP?
adaptivegenome is offline   Reply With Quote
Reply

Tags
sam, samtools

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:19 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO