Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • htseq-count __alignment_not_unique vs tophat

    Hi guys,

    I'm really confused by the numbers between tophat & htseq-count.

    There was a huge difference between __alignment_not_unique from htseq-count and unmapped reads from tophat. Here's the data:

    I mapped single-end, 50-bp short reads to mouse genome using tophat2.

    alignment_summary.txt from tophat showed this statistics below:

    Reads:
    Input : 59054037
    Mapped : 57231649 (96.9% of input)
    of these: 10102891 (17.7%) have multiple alignments (5593 have >20)
    96.9% overall read mapping rate.

    However, when I used htseq-count (v0.6.1p1, -s no) to do a raw count from accepted_hits.bam. I got a large number of non-unique alignment. See below:

    __no_feature 9924659
    __ambiguous 1401163
    __too_low_aQual 0
    __not_aligned 0
    __alignment_not_unique 25773467

    unmappable reads from tophat = 59045037-57231649 ~= 1.8M, but htseq-count was 25M

    Does it make sense to compare these two numbers? What might be wrong? BTW, both tophat and htseq-count used the same genes.gtf.

    Also, when I added up the numbers from the log file produced by htseq-count, the counts added up ~73M (mapped+no feature+ambiguous+not_unique), which was much higher than the 59M reads printed from tophat (and fastqc). Why?

    Many thanks.

  • #2
    Originally posted by chumho View Post
    Hi guys,

    I'm really confused by the numbers between tophat & htseq-count.

    There was a huge difference between __alignment_not_unique from htseq-count and unmapped reads from tophat. Here's the data:

    I mapped single-end, 50-bp short reads to mouse genome using tophat2.

    alignment_summary.txt from tophat showed this statistics below:

    Reads:
    Input : 59054037
    Mapped : 57231649 (96.9% of input)
    of these: 10102891 (17.7%) have multiple alignments (5593 have >20)
    96.9% overall read mapping rate.

    However, when I used htseq-count (v0.6.1p1, -s no) to do a raw count from accepted_hits.bam. I got a large number of non-unique alignment. See below:

    __no_feature 9924659
    __ambiguous 1401163
    __too_low_aQual 0
    __not_aligned 0
    __alignment_not_unique 25773467

    unmappable reads from tophat = 59045037-57231649 ~= 1.8M, but htseq-count was 25M

    Does it make sense to compare these two numbers? What might be wrong? BTW, both tophat and htseq-count used the same genes.gtf.

    Also, when I added up the numbers from the log file produced by htseq-count, the counts added up ~73M (mapped+no feature+ambiguous+not_unique), which was much higher than the 59M reads printed from tophat (and fastqc). Why?

    Many thanks.
    The tophat output says that 17.7% of your reads ~10M have multiple alignments. Those will all get counted as _alignment_no_unique since they can't be assigned to a unique site in the genome. Additionally, they'll show up multiple times in the sam/bam file, once for each possible alignment, which is why you can have more overall counts from htseq-count than you have reads.

    Comment


    • #3
      Thanks cmbetts for the reply.

      I agree that "The tophat output says that 17.7% of your reads ~10M have multiple alignments. Those will all get counted as _alignment_no_unique since they can't be assigned to a unique site in the genome." So that explains the 1.8M I calculated.

      "Additionally, they'll show up multiple times in the sam/bam file, once for each possible alignment, which is why you can have more overall counts from htseq-count than you have reads." Are you saying the unit of "__alignment_not_unique" is times instead of reads? E.g. One non-unique read mapped 6 locations will be added 6 times in "__alignment_not_unique" by htseq-count but it is counted as 1 by tophat?

      Comment


      • #4
        Think that you're misinterpreting the tophat output. You have 1.8M unmapped reads. Tophat doesn't even put those in the bam file, and if it did, they would be counted as _not_aligned.
        While tophat was able to map 96.9% of your reads (57.2M), 17.7% of those (10.1M) are not uniquely aligned, meaning that they have equally good alignments to two or more places in the genome. For those reads, they will have as many occurrences in the bam file as they have valid alignments, which means the bam has a minimum of 20.2M occurrences of those reads and some of those reads will have more than just two alignments, bringing the total up to the 25.7M seen by running htseq-count.

        from the htseq-count FAQ:
        Why is the sum of all counts different from the number of reads in my FASTQ file?
        A read with more than one reported alignment appears only once in the FASTQ file, but several times in the SAM file (once for each alignment), and each time htseq-count encounters one, it increased the __alignment_not_unique counter by one. Therefore, mutiply aligned reads are counted multiple times.

        Comment


        • #5
          Thanks. It makes sense now.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM
          • seqadmin
            Techniques and Challenges in Conservation Genomics
            by seqadmin



            The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

            Avian Conservation
            Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
            03-08-2024, 10:41 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, Yesterday, 06:37 PM
          0 responses
          8 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, Yesterday, 06:07 PM
          0 responses
          8 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-22-2024, 10:03 AM
          0 responses
          49 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-21-2024, 07:32 AM
          0 responses
          67 views
          0 likes
          Last Post seqadmin  
          Working...
          X