Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • htseq-count gets more reads?

    Hi,

    My sam file had been sorted by read names using 'samtools sort -n'. Then I used htseq-count to count reads in this sam file; I also double checked by counting unique read names using awk/uniq/wc. The sum of htseq-count output is larger than the number of unique read names. How can this happen? htseq-count count some reads multiple times? Thanks for any hints!

  • #2
    My guess: you have reads that were mapped to multiple locations, and htseq doesn't remove duplicates.

    You should get the same count if you drop the uniq form your command line (which amounts just to samtools view | wc -l, I guess)

    Comment


    • #3
      How did you sum up the htseq-count output? Because at the end of the htseq-count output you get these categories:

      no_feature
      ambiguous
      too_low_aQual
      not_aligned
      alignment_not_unique

      The last category is the number of reads with multiple hits, if you simply summed up all your output, without disregarding this last category then its going to be larger.

      Comment


      • #4
        Thanks for the responses, guys!

        I think I find the reason. The number of 'alignment_not_unique' is not what I expected.

        First of all, my understanding is the reads with multiple hits are not counted in any genes.

        So if it is true, and if 'alignment_not_unique' is the number of reads with multiple hits, the sum of reads mapped in genes (uniquely), reads not in genes (uniquely), ambiguous, multiple hits will be exact the total number in the SAM file.

        What I did is: using 'NH:i' tag, I separated alignments into two sam files, one is unique mappings (NH:i:1), the other is multiple hits (NH:i:n, n>1). Then ran htseq-count on both, and counted the unique read IDs in both sam files.

        In total, I have 12.8M reads, 11.9M in the unique mapping sam, 0.9M in the other sam.

        In the htseq-count of the unique sam, the 'alignment_not_unique' category is 0, and the sum of all genes and other categories is 11.8M (very close to 11.9, I am satisfied with it).

        In the htseq-cout of the other sam, all other genes and categories are 0, and the 'alignment_not_unique' category is 2.8M. Remember that sam has 0.9M unique read IDs, and 4.1M lines.

        So my conclusions:

        1. reads won't be count multiple times; if they cannot be uniquely mapped, they are counted in some categories.

        2. the numbers of genes and other categories are the number of reads; but the number of 'alignment_not_unique' is neither the number of reads nor the number of alignments. (my data is pair-end sequencing.)

        The reason I need these numbers is that I want to understand the RNA compositions (proportions of different regions), which is useful when comparing different biological samples.

        Right now, I take all numbers in htseq-count output except 'alignment_not_unique', then add the number of unique IDs with NH:i:n tag, the sum is (almost) what I expected.

        Simon, if you read this thread, do you think it is good to make this number is also a number of reads, so the whole output is consistent?

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM
        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin


          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        25 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        28 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        24 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-04-2024, 09:00 AM
        0 responses
        52 views
        0 likes
        Last Post seqadmin  
        Working...
        X