Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • TopHat2 align_summary.txt vs samtools flagstat

    Hi,

    There are very large differences between the concordant pair alignment rate calculated by TopHat2, and the properly paired percentage calculated by samtools flagstat.
    In the following example of paired-end 50 bases RNA-Seq, TopHat calculates a concordant pair alignment rate of 95.1%, while samtools flagstat calculates that 46.13% of the reads are properly paired.
    These large differences are present in all my analyses, not just this one.

    I trust the TopHat2 numbers more, but I need to be able to explain why samtools flagstat should not be used as a quality control.
    I assume TopHat2 just didn't set the flags correctly, causing problems to samtools flagstat, but I would like to confirm this.

    TopHat v2.0.10
    samtools v0.1.19
    RNA-Seq: 50 bases paired-end (library-type fr-firststrand)
    Average fragment size: 150 bases

    Code:
    [blancha@lg-1r17-n02 tophat]$ more tophat_R2.sh
    tophat \
    --no-novel-juncs -p 3 \
    --library-type fr-firststrand \
    -G genomes/Mus_musculus/Ensembl/GRCm38/Annotation/Genes/Mus_musculus.GRCm38.75.gtf \
    -o ../results/tophat/R2 \
    genomes//Mus_musculus/Ensembl/GRCm38/Sequence/Bowtie2Index/genome \
    ../data/FASTQ_files/untrimmed/HI.1774.007.Index_4.MAG_R2_R1.fastq.gz \
    ../data/FASTQ_files/untrimmed/HI.1774.007.Index_4.MAG_R2_R2.fastq.gz 
    
    [blancha@lg-1r17-n02 tophat]$ more ../results/tophat/R2/align_summary.txt
    Left reads:
              Input     :  39626765
               Mapped   :  39103386 (98.7% of input)
                of these:   2042138 ( 5.2%) have multiple alignments (117551 have >20)
    Right reads:
              Input     :  39626765
               Mapped   :  38922972 (98.2% of input)
                of these:   2035395 ( 5.2%) have multiple alignments (117438 have >20)
    98.5% overall read mapping rate.
    
    Aligned pairs:  38557982
         of these:   2008047 ( 5.2%) have multiple alignments
                      856546 ( 2.2%) are discordant alignments
    95.1% concordant pair alignment rate.
    
    [blancha@lg-1r17-n02 tophat]$ samtools flagstat ../results/tophat/R2/accepted_hits.bam 
    93634895 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 duplicates
    93634895 + 0 mapped (100.00%:-nan%)
    93634895 + 0 paired in sequencing
    46920556 + 0 read1
    46714339 + 0 read2
    43190782 + 0 properly paired (46.13%:-nan%)
    92474116 + 0 with itself and mate mapped
    1160779 + 0 singletons (1.24%:-nan%)
    9792980 + 0 with mate mapped to a different chr
    237120 + 0 with mate mapped to a different chr (mapQ>=5)
    Thank you for your help.
    Last edited by blancha; 06-23-2014, 05:19 PM.

  • #2
    same problem. Is there anyone to explain these differences??

    Comment


    • #3
      I think we need to trust the Tophat align_summary file to determine the number of reads since samtools flagstat reporting all multiple reads.

      Comment


      • #4
        Discrepancy between Tophat align_summary and samtools flagstat paired end

        I also used TopHat2 to align 50bp paired-end reads. I used the --no-mixed option in TopHat and then compared align_summary.txt to samtools flagstat. The numbers are not identical but they are not as off as the ones you report. (Note: In my data, the number of mapped reads is so low because I was mapping to a bacterial genome to check for contamination, thankfully there does not seem to be much).

        This is what I got:

        TopHat:
        Left reads:
        Input: 5442643
        Mapped: 18683 ( 0.3% of input)
        of these: 41 ( 0.2%) have multiple alignments (0 have >20)
        Right reads:
        Input: 5438826
        Mapped: 14866 ( 0.3% of input)
        of these: 685 ( 4.6%) have multiple alignments (0 have >20)
        0.3% overall read alignment rate.

        Aligned pairs: 4026
        of these: 0 ( 0.0%) have multiple alignments
        and: 0 ( 0.0%) are discordant alignments
        0.1% concordant pair alignment rate.



        SAMtools:
        8052 + 0 mapped (100.00%:-nan%)
        8052 + 0 paired in sequencing
        4026 + 0 read1
        4026 + 0 read2
        7602 + 0 properly paired (94.41%:-nan%)
        8052 + 0 with itself and mate mapped
        0 + 0 singletons (0.00%:-nan%)
        0 + 0 with mate mapped to a different chr
        0 + 0 with mate mapped to a different chr (mapQ>=5)

        So perhaps try running TopHat with --no-mixed option.

        You could also try --no-discordant, which tells TopHat to only report concordant pairs, but in my experience this causes the program to crash on versions <2.0.9.
        Last edited by bpb9; 11-16-2014, 06:56 PM. Reason: additional info

        Comment


        • #5
          I've figured out the issue.
          TopHat includes the intron in the insert size.
          Hence, when samtools flagstat reads the insert size, more often than not, the insert size is far greater than what would be expected from properly paired reads.

          So, the numbers are a bit misleading but not incorrect.

          TopHat includes the intron in the insert size reported.
          Properly paired reads can therefore have very large insert sizes.
          Samtools flagstat does not expect the intron to be included in the insert size.
          It reports as being improperly paired reads with large insert sizes, which actually just span an intron.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Essential Discoveries and Tools in Epitranscriptomics
            by seqadmin




            The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
            04-22-2024, 07:01 AM
          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, Yesterday, 08:47 AM
          0 responses
          16 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          60 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          60 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          54 views
          0 likes
          Last Post seqadmin  
          Working...
          X