SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
How to extract multi-mapped reads by samtools? mavishou RNA Sequencing 5 12-05-2016 05:27 AM
Compile multi-individual SNP tables from bowtie/samtools mappings KNS Bioinformatics 10 12-02-2011 08:40 AM
RTG Investigator 2.3: New Ion Torrent support, faster Complete Genomics mapping Stuart Inglis Vendor Forum 0 08-31-2011 01:27 PM
samtools flagstats bug or it does not support multi-reads? xinwu Bioinformatics 0 12-22-2010 09:33 PM
Cuffdiff multi-protein vs multi-promoter RockChalkJayhawk Bioinformatics 2 03-26-2010 10:26 AM

Reply
 
Thread Tools
Old 03-06-2012, 04:59 PM   #21
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

BZIP compress in blocks, so it actually fits the model of BAM quite well. The default block size in BAM is 65536, so upping that to 100K wouldn't be too hard. If it saves 30%, then it could be an alternative to CRAM (i.e. "get rid of all things").
nilshomer is offline   Reply With Quote
Old 03-21-2012, 08:58 PM   #22
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by genericforms View Post
But is it worth it? BZIP wins on parallelization with lots of cores, but is this useful? I thought samtools reads and write in small blocks that are separately compressed and decompressed. So it seems you can just parallelize that, right? Do you really benefit from using BZIP over GZIP?
Well the best way to answer the question is to do it (on 24 threads).

command ------------- | c (s) | d (s) | size (MB)
pbgzip -t 0 (gz) ---- | 17.29 | 21.24 | 698MB
pbgzip -t 1 (bz2) --- | 18.43 | 21.13 | 804MB
pbzip2 -------------- | 21.36 | 21.23 | 640MB


Since BAM uses such small block sizes (63488 bytes), the BZ2 compression is not as good as when using larger block sizes, like in pbzip2. While pbzip2 file size is 80% of pbgzip (gz), the file size of pbgzip (bz2) is a respectable 86%. Compression and decompression times were not too different either.

Last edited by nilshomer; 03-21-2012 at 09:01 PM.
nilshomer is offline   Reply With Quote
Old 03-21-2012, 11:53 PM   #23
arvid
Senior Member
 
Location: Berlin

Join Date: Jul 2011
Posts: 156
Default

I guess most interested people are on the samtools-devel/help lists; however, for those who aren't, Heng just announced a multi-threaded samtools sort/merge/view:

http://sourceforge.net/mailarchive/m...sg_id=29019265

It'd be interesting with a merge between his approach and Nils', if feasible... or did you already re-introduce multi-threaded sort, Nils?
arvid is offline   Reply With Quote
Old 03-22-2012, 07:24 PM   #24
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by arvid View Post
I guess most interested people are on the samtools-devel/help lists; however, for those who aren't, Heng just announced a multi-threaded samtools sort/merge/view:

http://sourceforge.net/mailarchive/m...sg_id=29019265

It'd be interesting with a merge between his approach and Nils', if feasible... or did you already re-introduce multi-threaded sort, Nils?
The way Heng implemented it was to multi-thread the in-memory sort, as well as when merging multiple BAM files, to multi-thread the compression. There is also support in the new bgzf.c to support multi-threaded writing, which is used in the merging above. The multi-threaded writing is not used elsewhere.

I think the part I would integrate is Heng's multi-threaded sort routine, while the rest is already there.
nilshomer is offline   Reply With Quote
Old 03-23-2012, 12:09 AM   #25
arvid
Senior Member
 
Location: Berlin

Join Date: Jul 2011
Posts: 156
Thumbs up

Quote:
Originally Posted by nilshomer View Post
The way Heng implemented it was to multi-thread the in-memory sort, as well as when merging multiple BAM files, to multi-thread the compression. There is also support in the new bgzf.c to support multi-threaded writing, which is used in the merging above. The multi-threaded writing is not used elsewhere.

I think the part I would integrate is Heng's multi-threaded sort routine, while the rest is already there.
Great, keep up the good work!
arvid is offline   Reply With Quote
Old 04-11-2012, 02:09 AM   #26
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,542
Default

Quote:
Originally Posted by nilshomer View Post
BZIP compress in blocks, so it actually fits the model of BAM quite well. The default block size in BAM is 65536, so upping that to 100K wouldn't be too hard. If it saves 30%, then it could be an alternative to CRAM (i.e. "get rid of all things").
You can use 100K, 200K, ..., 900K blocks in BZIP2 - the larger the block size the better the compression rate of course. This would require rejigging the BGZF virtual offset... the current 64bit trick won't work.

You'd also need to solve the non-byte aligned block issue, perhaps extending or working around the C library's API: http://blastedbio.blogspot.co.uk/201...-to-bzip2.html

This is assuming using multiple cores would overcome the inherently higher CPU load of BZIP vs GZIP - which sounds viable in principle
maubp is offline   Reply With Quote
Old 04-11-2012, 12:20 PM   #27
adaptivegenome
Super Moderator
 
Location: US

Join Date: Nov 2009
Posts: 437
Default

Quote:
Originally Posted by maubp View Post
You can use 100K, 200K, ..., 900K blocks in BZIP2 - the larger the block size the better the compression rate of course. This would require rejigging the BGZF virtual offset... the current 64bit trick won't work.

You'd also need to solve the non-byte aligned block issue, perhaps extending or working around the C library's API: http://blastedbio.blogspot.co.uk/201...-to-bzip2.html

This is assuming using multiple cores would overcome the inherently higher CPU load of BZIP vs GZIP - which sounds viable in principle
I would how important it is to speed up the compression this much. I think for really big files, the I/O probably becomes less of a problem? We have seen this at least for ~20-50GB human BAMs, where for fly BAMs that are 1-3GB, the I/O is more of a bottleneck for us.
adaptivegenome is offline   Reply With Quote
Old 04-11-2012, 12:28 PM   #28
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,542
Default

That's the bonus of using compressed files - they are faster to read off disk (as long as the CPU overhead doesn't cost you too much). i.e. Using more compression can save I/O.
maubp is offline   Reply With Quote
Old 04-11-2012, 12:39 PM   #29
adaptivegenome
Super Moderator
 
Location: US

Join Date: Nov 2009
Posts: 437
Default

Quote:
Originally Posted by maubp View Post
That's the bonus of using compressed files - they are faster to read off disk (as long as the CPU overhead doesn't cost you too much). i.e. Using more compression can save I/O.
Yes, I agree. I was just suggesting that perhaps I/O might not be the limiting factor for really big files, or at least that it might not be worth spending too much time trying to speed up compression beyond simply multithreading the existing block method...
adaptivegenome is offline   Reply With Quote
Old 04-12-2012, 03:03 AM   #30
StaciaWyman
Junior Member
 
Location: Cambridge, MA

Join Date: Jun 2010
Posts: 1
Default

Quote:
Originally Posted by nilshomer View Post
I have been working on speeding up reading and writing within SAMtools by creating a multi-threaded block gzip reader and writer. I have an alpha version working. I would appreciate some feedback and testing, just don't use it for production systems yet. Thank-you!

http://github.com/nh13/samtools/tree/pbgzip

NB: I would be happy describe the implementation, and collaborate to get this into Picard too.
Good morning--I get page not found error when I go to the above link--is there an updated one? Thanks!
Stacia
StaciaWyman is offline   Reply With Quote
Old 04-12-2012, 04:51 PM   #31
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by StaciaWyman View Post
Good morning--I get page not found error when I go to the above link--is there an updated one? Thanks!
Stacia
Check out the different branches available, as well as the commits:
https://github.com/nh13/samtools

If you are brave, I have also been working on getting this into Picard:
https://github.com/nh13/picard

The Picard developers are more receptive than the samtools develpers to a patch.
nilshomer is offline   Reply With Quote
Old 04-12-2012, 05:33 PM   #32
kenietz
Member
 
Location: Singapore

Join Date: Nov 2011
Posts: 85
Default

Hi Nils, why dont you use pigz/unpigz which is parallelized gzip/gunzip? It takes all the arguments as normal gzip and with -p one can specify the number of threads.
kenietz is offline   Reply With Quote
Old 04-12-2012, 06:26 PM   #33
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Do you mean I should use the PIGZ API? Of course I could compress a SAM file with pigz, but the advantage of the BAM file (which is block gzip compressed) is the ability to index the file and then do random retrieval based on genomic coordinates.

Can you give an example of what you mean?
nilshomer is offline   Reply With Quote
Old 05-02-2012, 03:48 AM   #34
ersgupta
Member
 
Location: India

Join Date: Jun 2011
Posts: 26
Default

any update on the mpileup ??
ersgupta is offline   Reply With Quote
Old 05-04-2012, 12:04 PM   #35
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by ersgupta View Post
any update on the mpileup ??
No, the individual tools were not multi-thread, just the reading/writing of the SAM/BAM files, which can be a bottleneck.
nilshomer is offline   Reply With Quote
Old 06-04-2012, 09:13 PM   #36
adaptivegenome
Super Moderator
 
Location: US

Join Date: Nov 2009
Posts: 437
Default

Guys, we have a working version of a faster mergesort for BAMs:

https://github.com/adaptivegenome/openge/downloads

Source is also there but if you want to test speed you can grab the binary to make things easy. We implemented SAM to BAM, mergesort, mark duplicates, and some other routines.

Would love feedback on whether it is faster or not than what others are doing...
adaptivegenome is offline   Reply With Quote
Old 06-04-2012, 09:34 PM   #37
Heisman
Senior Member
 
Location: St. Louis

Join Date: Dec 2010
Posts: 535
Default

For a non-computer intelligent person like myself, I am confused regarding if I should update beyond samtools 0.1.18 to the new multithreaded versions, my confusion mainly stemming from not knowing if they work or not and how to download them. Is there some web page anywhere that documents the changes being made and when they are considered working and safe to use, and then where to download them from?
Heisman is offline   Reply With Quote
Old 06-13-2012, 02:51 PM   #38
thetaomega3
Junior Member
 
Location: California, US

Join Date: Jun 2012
Posts: 1
Default

Hi Nils,

I gave it a try (0.1.18-r572) and had mixed results.

Success: going from sam to bam (samtools import) on a 102Gb sam file results in a ~10X speedup on a 24-core (HT) machine with 192GB RAM, and the output bam (27 Gb) file matches one generated from the general non-mt release (0.1.18 r982:295) (using diff).

Failure: sort fails with error "failed to create threads" when it attempts to merge all the intermediate sorted bam files. Running samtools merge on the same set of files also fails with the same error. Tried -n 6, 12 and 24 with no success. The general non-mt release completes the sort and merge successfully.

Suggestions?
thetaomega3 is offline   Reply With Quote
Old 07-23-2012, 12:33 PM   #39
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 700
Default

https://github.com/nh13/samtools ?

How is this project going?
Richard Finney is offline   Reply With Quote
Old 07-23-2012, 04:52 PM   #40
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by Richard Finney View Post
https://github.com/nh13/samtools ?

How is this project going?
I haven't taken a look at the sort problem, but not much else.
nilshomer is offline   Reply With Quote
Reply

Tags
sam, samtools

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:24 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO