Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • htseq-count gets more reads?

    Hi,

    My sam file had been sorted by read names using 'samtools sort -n'. Then I used htseq-count to count reads in this sam file; I also double checked by counting unique read names using awk/uniq/wc. The sum of htseq-count output is larger than the number of unique read names. How can this happen? htseq-count count some reads multiple times? Thanks for any hints!

  • #2
    My guess: you have reads that were mapped to multiple locations, and htseq doesn't remove duplicates.

    You should get the same count if you drop the uniq form your command line (which amounts just to samtools view | wc -l, I guess)

    Comment


    • #3
      How did you sum up the htseq-count output? Because at the end of the htseq-count output you get these categories:

      no_feature
      ambiguous
      too_low_aQual
      not_aligned
      alignment_not_unique

      The last category is the number of reads with multiple hits, if you simply summed up all your output, without disregarding this last category then its going to be larger.

      Comment


      • #4
        Thanks for the responses, guys!

        I think I find the reason. The number of 'alignment_not_unique' is not what I expected.

        First of all, my understanding is the reads with multiple hits are not counted in any genes.

        So if it is true, and if 'alignment_not_unique' is the number of reads with multiple hits, the sum of reads mapped in genes (uniquely), reads not in genes (uniquely), ambiguous, multiple hits will be exact the total number in the SAM file.

        What I did is: using 'NH:i' tag, I separated alignments into two sam files, one is unique mappings (NH:i:1), the other is multiple hits (NH:i:n, n>1). Then ran htseq-count on both, and counted the unique read IDs in both sam files.

        In total, I have 12.8M reads, 11.9M in the unique mapping sam, 0.9M in the other sam.

        In the htseq-count of the unique sam, the 'alignment_not_unique' category is 0, and the sum of all genes and other categories is 11.8M (very close to 11.9, I am satisfied with it).

        In the htseq-cout of the other sam, all other genes and categories are 0, and the 'alignment_not_unique' category is 2.8M. Remember that sam has 0.9M unique read IDs, and 4.1M lines.

        So my conclusions:

        1. reads won't be count multiple times; if they cannot be uniquely mapped, they are counted in some categories.

        2. the numbers of genes and other categories are the number of reads; but the number of 'alignment_not_unique' is neither the number of reads nor the number of alignments. (my data is pair-end sequencing.)

        The reason I need these numbers is that I want to understand the RNA compositions (proportions of different regions), which is useful when comparing different biological samples.

        Right now, I take all numbers in htseq-count output except 'alignment_not_unique', then add the number of unique IDs with NH:i:n tag, the sum is (almost) what I expected.

        Simon, if you read this thread, do you think it is good to make this number is also a number of reads, so the whole output is consistent?

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Techniques and Challenges in Conservation Genomics
          by seqadmin



          The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

          Avian Conservation
          Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
          03-08-2024, 10:41 AM
        • seqadmin
          The Impact of AI in Genomic Medicine
          by seqadmin



          Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
          02-26-2024, 02:07 PM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 03-14-2024, 06:13 AM
        0 responses
        33 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 03-08-2024, 08:03 AM
        0 responses
        72 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 03-07-2024, 08:13 AM
        0 responses
        80 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 03-06-2024, 09:51 AM
        0 responses
        68 views
        0 likes
        Last Post seqadmin  
        Working...
        X