SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
inconsistent between samtools mpileup output and bam tview Rainbird Bioinformatics 0 05-31-2013 11:08 PM
bam file statistics adrian Bioinformatics 3 10-09-2012 02:02 AM
convert bam file to bedgraph format with inconsistent results zhiqiang Bioinformatics 0 07-27-2012 12:48 AM
How to get statistics data from sam/bam file generated by trinity packages taoxiang180 Bioinformatics 1 07-05-2012 01:46 AM
BAM with inconsistent flag and mapq dariober Bioinformatics 2 07-18-2011 12:27 AM

Reply
 
Thread Tools
Old 11-08-2013, 07:42 AM   #1
capricy
Senior Member
 
Location: 63130

Join Date: Apr 2012
Posts: 125
Default inconsistent .bam file statistics

I used tophat2 to map my RNAseq reads to the reference genome. When I tried to get the statistics for the resultant mapped bam file, I found the inconsistency between the different tools.

'samstat' gave me a total number of 33.7M aligned including all MAPQ scores,

however 'samtools flagstat' gave me 37.3M aligned reads.

Here I see over 10% off. Could anybody explain why and what should I follow?
capricy is offline   Reply With Quote
Old 11-08-2013, 08:15 AM   #2
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

I would guess that samstat doesn't count secondary alignments (samtools flagstat does).
dpryan is offline   Reply With Quote
Old 11-08-2013, 09:26 AM   #3
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

I noticed something like that. I had the same fastq aligned with top hat and bwa. samtools flagstat on the bwa file gave me exactly as many lines as were in the fastq (I counted the fastq to be sure). samtools flagstat on the tophat file gave me more mapped reads than there were fastq entries!

I think the answer is that bowtie is reporting more than one line for some reads, and bwa isn't. So maybe samstat is accounting for that, and flagstat isn't.
swbarnes2 is offline   Reply With Quote
Old 11-08-2013, 10:47 AM   #4
capricy
Senior Member
 
Location: 63130

Join Date: Apr 2012
Posts: 125
Default

My situation is: the same aligned bam file and different stat results.
capricy is offline   Reply With Quote
Old 12-04-2013, 11:43 PM   #5
Brajbio
Member
 
Location: India

Join Date: Jun 2010
Posts: 20
Default flagstat and rmdup issues

Quote:
Originally Posted by swbarnes2 View Post
I noticed something like that. I had the same fastq aligned with top hat and bwa. samtools flagstat on the bwa file gave me exactly as many lines as were in the fastq (I counted the fastq to be sure). samtools flagstat on the tophat file gave me more mapped reads than there were fastq entries!

I think the answer is that bowtie is reporting more than one line for some reads, and bwa isn't. So maybe samstat is accounting for that, and flagstat isn't.
I too have observed the similar issue when I run 'flagstat' on the alignment files (.bam) resulted from the bowtie1 (with -I as 100 and -X as 300) and Tophat 2 (with -r as 200) respectively. My PE Illumina reads are 50 bases which I had trimmed for better quality and now the reads length are of 37 bases.

Here is the flagstat output on sample reads aligned using bowtie 1
---------------------------------------------------------------

73321912 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
49773724 + 0 mapped (67.88%:nan%)
73321912 + 0 paired in sequencing
36660956 + 0 read1
36660956 + 0 read2
49773724 + 0 properly paired (67.88%:nan%)
49773724 + 0 with itself and mate mapped
0 + 0 singletons (0.00%:nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)


Here is the flagstat output on the same sample reads aligned using Tophat2
------------------------------------------------------------------------

82281727 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
82281727 + 0 mapped (100.00%:nan%)
82281727 + 0 paired in sequencing
41392697 + 0 read1
40889030 + 0 read2
51432316 + 0 properly paired (62.51%:nan%)
80601394 + 0 with itself and mate mapped
1680333 + 0 singletons (2.04%:nan%)
7719052 + 0 with mate mapped to a different chr
591122 + 0 with mate mapped to a different chr (mapQ>=5)



Observations:

1. Difference in total input reads

2. 'rmdup' command is working well on the sorted bowtie1 generated bam whereas on tophat2 generated bam (already coordinate sorted) I am getting error/warnings like..

-------------
[bam_rmdup_core] inconsistent BAM file for pair 'HWI-1KL155:43:H088BADXX:2:2108:10331:62994'. Continue anyway.
[bam_rmdup_core] inconsistent BAM file for pair 'HWI-1KL155:43:H088BADXX:1:2102:8980:94393'. Continue anyway.

-----------------------------



Could someone share their experience or suggest whats going wrong and how to handle this disparity.

Regards

Last edited by Brajbio; 12-05-2013 at 12:04 AM.
Brajbio is offline   Reply With Quote
Old 12-04-2013, 11:55 PM   #6
sBeier
Member
 
Location: Germany

Join Date: Jan 2013
Posts: 42
Default

The difference in total input reads should come from the fact that you probably used accepted_hits.bam as TopHat output file. It does not include the unmapped reads, while the Bowtie output does. You can see that the Tophat results have 100% mapped reads.

You might have run bowtie with the option that only accepts concordant pairs. This means, there will be no reads mapped as singletons or with the mate mapped to a different chromosome, because then the mapped read would count as unmapped, as it is not part of a concordant pair.

This also already explained the difference in mapping percentage.

Considering that you mapped RNAseq reads, your problem with rmdup is more that duplicate removal does not do good for RNAseq reads. Okay, I think this is an ongoing debate. But it will definitely mess up your quantitative results. But there is also another thread about this problem on here that might help you. http://seqanswers.com/forums/showthread.php?t=19856
sBeier is offline   Reply With Quote
Old 12-05-2013, 12:38 AM   #7
Brajbio
Member
 
Location: India

Join Date: Jun 2010
Posts: 20
Default

Thanks sBeier, for the response.

No, I have not applied the options to accept concordant reads in bowtie1 run (is this option available in bowtie-0.12.3? Not sure if it is taken care by default on using -X option), but yes I have used only the "protein coding genes" isoforms sequences as my reference while alignment with bowtie 1 whereas with the tophat2 I have used the hg19 complete seq with gtf including all 'NM' and 'NR' entries.

Difference in the references (with and without NR sequences) might be one reason I can think of and as you said unmapped reads counts are included in the bowtie1 alignment stats could led to difference in mapping %. I admit

But still the input reads files used is same in both the cases but I can see the difference in their count too.


73321912 + 0 in total (QC-passed reads + QC-failed reads)
-------------------------------------
82281727 + 0 in total (QC-passed reads + QC-failed reads)


Regarding rmdup don't see much convincing answers in the blogs I think i should explore the code here http://samtools.sourcearchive.com/do...8c-source.html
Brajbio is offline   Reply With Quote
Old 08-29-2014, 11:41 PM   #8
enelkinsan
Member
 
Location: NZ

Join Date: May 2012
Posts: 14
Default

Quote:
Originally Posted by sBeier View Post
The difference in total input reads should come from the fact that you probably used accepted_hits.bam as TopHat output file. It does not include the unmapped reads, while the Bowtie output does. You can see that the Tophat results have 100% mapped reads.

http://seqanswers.com/forums/showthread.php?t=19856
That's all right, but how the number of mapped reads (bowtie2, accepted_hits.bam) can be bigger than number of reads in the original fastq file? I got this result once and I'm slightly confused how mapped reads are accounted for in bowtie2 algorithm.
enelkinsan is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:12 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO