Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • inconsistent .bam file statistics

    I used tophat2 to map my RNAseq reads to the reference genome. When I tried to get the statistics for the resultant mapped bam file, I found the inconsistency between the different tools.

    'samstat' gave me a total number of 33.7M aligned including all MAPQ scores,

    however 'samtools flagstat' gave me 37.3M aligned reads.

    Here I see over 10% off. Could anybody explain why and what should I follow?

  • #2
    I would guess that samstat doesn't count secondary alignments (samtools flagstat does).

    Comment


    • #3
      I noticed something like that. I had the same fastq aligned with top hat and bwa. samtools flagstat on the bwa file gave me exactly as many lines as were in the fastq (I counted the fastq to be sure). samtools flagstat on the tophat file gave me more mapped reads than there were fastq entries!

      I think the answer is that bowtie is reporting more than one line for some reads, and bwa isn't. So maybe samstat is accounting for that, and flagstat isn't.

      Comment


      • #4
        My situation is: the same aligned bam file and different stat results.

        Comment


        • #5
          flagstat and rmdup issues

          Originally posted by swbarnes2 View Post
          I noticed something like that. I had the same fastq aligned with top hat and bwa. samtools flagstat on the bwa file gave me exactly as many lines as were in the fastq (I counted the fastq to be sure). samtools flagstat on the tophat file gave me more mapped reads than there were fastq entries!

          I think the answer is that bowtie is reporting more than one line for some reads, and bwa isn't. So maybe samstat is accounting for that, and flagstat isn't.
          I too have observed the similar issue when I run 'flagstat' on the alignment files (.bam) resulted from the bowtie1 (with -I as 100 and -X as 300) and Tophat 2 (with -r as 200) respectively. My PE Illumina reads are 50 bases which I had trimmed for better quality and now the reads length are of 37 bases.

          Here is the flagstat output on sample reads aligned using bowtie 1
          ---------------------------------------------------------------

          73321912 + 0 in total (QC-passed reads + QC-failed reads)
          0 + 0 duplicates
          49773724 + 0 mapped (67.88%:nan%)
          73321912 + 0 paired in sequencing
          36660956 + 0 read1
          36660956 + 0 read2
          49773724 + 0 properly paired (67.88%:nan%)
          49773724 + 0 with itself and mate mapped
          0 + 0 singletons (0.00%:nan%)
          0 + 0 with mate mapped to a different chr
          0 + 0 with mate mapped to a different chr (mapQ>=5)


          Here is the flagstat output on the same sample reads aligned using Tophat2
          ------------------------------------------------------------------------

          82281727 + 0 in total (QC-passed reads + QC-failed reads)
          0 + 0 duplicates
          82281727 + 0 mapped (100.00%:nan%)
          82281727 + 0 paired in sequencing
          41392697 + 0 read1
          40889030 + 0 read2
          51432316 + 0 properly paired (62.51%:nan%)
          80601394 + 0 with itself and mate mapped
          1680333 + 0 singletons (2.04%:nan%)
          7719052 + 0 with mate mapped to a different chr
          591122 + 0 with mate mapped to a different chr (mapQ>=5)



          Observations:

          1. Difference in total input reads

          2. 'rmdup' command is working well on the sorted bowtie1 generated bam whereas on tophat2 generated bam (already coordinate sorted) I am getting error/warnings like..

          -------------
          [bam_rmdup_core] inconsistent BAM file for pair 'HWI-1KL155:43:H088BADXX:2:2108:10331:62994'. Continue anyway.
          [bam_rmdup_core] inconsistent BAM file for pair 'HWI-1KL155:43:H088BADXX:1:2102:8980:94393'. Continue anyway.

          -----------------------------



          Could someone share their experience or suggest whats going wrong and how to handle this disparity.

          Regards
          Last edited by Brajbio; 12-05-2013, 12:04 AM.

          Comment


          • #6
            The difference in total input reads should come from the fact that you probably used accepted_hits.bam as TopHat output file. It does not include the unmapped reads, while the Bowtie output does. You can see that the Tophat results have 100% mapped reads.

            You might have run bowtie with the option that only accepts concordant pairs. This means, there will be no reads mapped as singletons or with the mate mapped to a different chromosome, because then the mapped read would count as unmapped, as it is not part of a concordant pair.

            This also already explained the difference in mapping percentage.

            Considering that you mapped RNAseq reads, your problem with rmdup is more that duplicate removal does not do good for RNAseq reads. Okay, I think this is an ongoing debate. But it will definitely mess up your quantitative results. But there is also another thread about this problem on here that might help you. http://seqanswers.com/forums/showthread.php?t=19856

            Comment


            • #7
              Thanks sBeier, for the response.

              No, I have not applied the options to accept concordant reads in bowtie1 run (is this option available in bowtie-0.12.3? Not sure if it is taken care by default on using -X option), but yes I have used only the "protein coding genes" isoforms sequences as my reference while alignment with bowtie 1 whereas with the tophat2 I have used the hg19 complete seq with gtf including all 'NM' and 'NR' entries.

              Difference in the references (with and without NR sequences) might be one reason I can think of and as you said unmapped reads counts are included in the bowtie1 alignment stats could led to difference in mapping %. I admit

              But still the input reads files used is same in both the cases but I can see the difference in their count too.


              73321912 + 0 in total (QC-passed reads + QC-failed reads)
              -------------------------------------
              82281727 + 0 in total (QC-passed reads + QC-failed reads)


              Regarding rmdup don't see much convincing answers in the blogs I think i should explore the code here http://samtools.sourcearchive.com/do...8c-source.html

              Comment


              • #8
                Originally posted by sBeier View Post
                The difference in total input reads should come from the fact that you probably used accepted_hits.bam as TopHat output file. It does not include the unmapped reads, while the Bowtie output does. You can see that the Tophat results have 100% mapped reads.

                http://seqanswers.com/forums/showthread.php?t=19856
                That's all right, but how the number of mapped reads (bowtie2, accepted_hits.bam) can be bigger than number of reads in the original fastq file? I got this result once and I'm slightly confused how mapped reads are accounted for in bowtie2 algorithm.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Advancing Precision Medicine for Rare Diseases in Children
                  by seqadmin




                  Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                  12-16-2024, 07:57 AM
                • seqadmin
                  Recent Advances in Sequencing Technologies
                  by seqadmin



                  Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                  Long-Read Sequencing
                  Long-read sequencing has seen remarkable advancements,...
                  12-02-2024, 01:49 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 12-17-2024, 10:28 AM
                0 responses
                30 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 12-13-2024, 08:24 AM
                0 responses
                45 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 12-12-2024, 07:41 AM
                0 responses
                33 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 12-11-2024, 07:45 AM
                0 responses
                45 views
                0 likes
                Last Post seqadmin  
                Working...
                X