SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
BWA sam and Samtools sam->bam conversion problem maasha Bioinformatics 6 06-05-2013 08:39 AM
BAM -> SAM conversion error jwhite Bioinformatics 16 02-13-2013 06:32 AM
samtools: parse error in SAM to BAM conversion chrisW Bioinformatics 12 01-14-2013 07:16 PM
Issue with Sam-Bam conversion samtools - how to remove last line of Sam file? TabeaK Bioinformatics 3 11-19-2012 11:05 AM
tophat - cannot view sam output in samtools tview lmilne Bioinformatics 2 12-01-2009 01:13 PM

Reply
 
Thread Tools
Old 08-21-2014, 02:10 PM   #1
wdemos
Member
 
Location: Wisconsin

Join Date: Jun 2012
Posts: 31
Default Threaded Sam to Bam conversion with Samtools View

I am attempting to speed up the sam to bam conversion of a whole genome (paired end) alignment with Samtools0-1.19. Alignment was done with bwa-0.7.7 mem algorithm
Commands:
samtools view -bS -o .bam pe.sam (default)
samtools view -bS -@ 10 -m 2G -o .bam pe.sam (threaded)

Comparing the output .bam files there is a 0.4G difference in file size.
I ran samtools flagstat on both bam files.
Differences:
6,026,490 QC passed reads
6,026,490 paired in sequencing
779,134 read 1
5,247,356 read 2
all other metrics are identical

Can anyone explain why threading would give less reads on the same input .sam file? I assume it may have to do with the merging of thread data?

Is there a way to correct this issue without losing the speed increase provided by threading? (Currently without threading conversion takes 6 hours. With threading Conversion takes 1 hour 15 minutes)
wdemos is offline   Reply With Quote
Old 08-21-2014, 02:57 PM   #2
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

That really shouldn't happen. It makes sense that you could get different sized files, but they should contain identical number of reads. Does the same thing happen if you use the new version?
dpryan is offline   Reply With Quote
Old 08-21-2014, 03:03 PM   #3
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 700
Default

MAYBE ...

The gzip compression used by samtools binary BAM version will be using different input streams depending on threading or non-threading.

The non-threaded version will discover a different optimization that the threaded version.

Can both BAMS versions be re-transformed back to identical SAM files?

dpryan is right about you should have the same read counts.

Last edited by Richard Finney; 08-21-2014 at 03:06 PM.
Richard Finney is offline   Reply With Quote
Old 08-22-2014, 12:01 PM   #4
wdemos
Member
 
Location: Wisconsin

Join Date: Jun 2012
Posts: 31
Default

Thanks. I am currently testing a comparison run with the latest version of samtools. Hope to have an update soon.
wdemos is offline   Reply With Quote
Old 08-26-2014, 12:59 PM   #5
wdemos
Member
 
Location: Wisconsin

Join Date: Jun 2012
Posts: 31
Default

I completed the same test with the latest version of samtools. Again there is a 0.4G difference in file size:
10 Threads 2G:
1367501828 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
1367501828 + 0 mapped (100.00%:nan%)
1367501828 + 0 paired in sequencing
685896751 + 0 read1
681605077 + 0 read2
1343716094 + 0 properly paired (98.26%:nan%)
1362706162 + 0 with itself and mate mapped
4795666 + 0 singletons (0.35%:nan%)
11194510 + 0 with mate mapped to a different chr
5019248 + 0 with mate mapped to a different chr (mapQ>=5)

Single Thread, default settings:
1373528318 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
1367501828 + 0 mapped (99.56%:nan%)
1373528318 + 0 paired in sequencing
686675885 + 0 read1
686852433 + 0 read2
1343716094 + 0 properly paired (97.83%:nan%)
1362706162 + 0 with itself and mate mapped
4795666 + 0 singletons (0.35%:nan%)
11194510 + 0 with mate mapped to a different chr
5019248 + 0 with mate mapped to a different chr (mapQ>=5)

Differences:
6,026,490 QC passed reads
6,026,490 paired in sequencing
779,134 read 1
5,247,356 read 2

Any ideas? Is it just not possible to use multiple threads for this without losing data?
wdemos is offline   Reply With Quote
Old 08-26-2014, 01:03 PM   #6
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

This is troubling. Can you post the exact commands you used?

Would it be possible for you to gzip the original SAM file and post is somewhere (google drive, dropbox, etc.) so we can track down exactly why this is happening?
dpryan is offline   Reply With Quote
Old 08-26-2014, 01:07 PM   #7
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

I should note that it's interesting that the differences are due to unmapped reads.
dpryan is offline   Reply With Quote
Old 08-26-2014, 01:46 PM   #8
wdemos
Member
 
Location: Wisconsin

Join Date: Jun 2012
Posts: 31
Default

Below are the commands that I used to run the sort. I am unable to provide the sam file as it is human data that is not involved in a public study.

threaded
Run on our cluster, all threads on one server
samtools view -bS -@ 10 -m 2G -o {sample}.bam {sample}_pe.sam

Run on our cluster, same server as above
samtools view -bS -o {sample}.bam {sample}_pe.sam
wdemos is offline   Reply With Quote
Old 08-26-2014, 01:53 PM   #9
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Fair enough (can't argue with HIPAA), I'll have to see if I can reproduce this with a dataset locally (since I only keep BAM files around, I'll have to realign something). There's certainly no reason those commands should produce this behavior.
dpryan is offline   Reply With Quote
Old 08-26-2014, 01:56 PM   #10
wdemos
Member
 
Location: Wisconsin

Join Date: Jun 2012
Posts: 31
Default

thank you for looking into this. Greatly appreciated.
wdemos is offline   Reply With Quote
Old 08-26-2014, 02:05 PM   #11
wdemos
Member
 
Location: Wisconsin

Join Date: Jun 2012
Posts: 31
Default

If it is of any use the sam file was generated as follows:
bwa mem -M -t 8 -R ''$RG'' $bwaREF $read1 $read2 > ${sample}_pe.sam
wdemos is offline   Reply With Quote
Old 08-27-2014, 01:17 AM   #12
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

You read my mind, thanks
dpryan is offline   Reply With Quote
Old 08-27-2014, 04:59 AM   #13
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

I should have looked more carefully at the command you were using. You're mixing up the samtools view and samtools sort options. It looks like you're trying to allow each compression thread to have up to 2 gigs of memory (-m 2G), but that's not what the -m option does in samtools view. What that does is in samtools view is filter by query read length according to the CIGAR string. Since unmapped reads have no CIGAR string (they have an * there), you end up filtering them out. This is the cause of the difference you're seeing.

BTW, there's no way to tell the compressor threads how much memory to use when doing the conversion.
dpryan is offline   Reply With Quote
Old 08-27-2014, 08:32 AM   #14
wdemos
Member
 
Location: Wisconsin

Join Date: Jun 2012
Posts: 31
Default

Ah, thank you!
wdemos is offline   Reply With Quote
Reply

Tags
sam to bam conversion, samtools, thread

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:02 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO