Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • htseq-count: many reads are "no_feature".

    Hi,
    I just started to use htseq-count for my paired-end RNA-Seq data. I've name-sorted uniquely mapped reads outputted by tophat, converted the sorted bam file to sam, and used the sam file with an ensembl gtf file (version 74 for human) as the input to htseq-count. Here's the command and output.
    Command:
    python -m HTSeq.scripts.count -o out.sam -i gene_name sorted.sam Ensembl_GRCh37.74.gtf > counts.txt

    Output:
    ...
    14098588 sam line pairs processed.
    no_feature 11732479
    ambiguous 52152
    too_low_aQual 0
    not_aligned 0
    alignment_not_unique 0

    There are 21150698 reads (some paired, some not) in my sorted sam file. I understand that paired reads are counted as one, and that's why there are only 14098588 processed by htseq-count. Here are my questions:
    1) Why are there so many "no_feature" reads? Is this normal for a typical RNA-Seq experiment? Is 11732479 the actual number of reads (paired or not) or the number of single reads + pairs counted as one?
    2) I also checked the output.sam but there are only 11585093 reads there. It looks like htseq-count didn't record all the reads that it processed. So how does htseq-count choose which reads to record?

    Thank you,
    Sylvia

  • #2
    Originally posted by zzhao2 View Post
    1) Why are there so many "no_feature" reads? Is this normal for a typical RNA-Seq experiment? Is 11732479 the actual number of reads (paired or not) or the number of single reads + pairs counted as one?
    Hi- Do the chromosome names in the GTF file match those in the sam file? Ensembl files have chromosome names like 1, 2, 3 etc. while UCSC uses chr1, chr2, chr3 etc. Also, maybe check you are using the same version for both the reference fasta and the gtf (e.g. *both* are hg19).

    I guess 11732479 refers to number of fragments so a single end read counts as 1 and the two ends of a paired-end read count 1 as well.

    Good luck
    Dario

    Comment


    • #3
      Originally posted by zzhao2 View Post
      Command:
      Code:
      python -m HTSeq.scripts.count -o out.sam -i gene_name sorted.sam Ensembl_GRCh37.74.gtf > counts.txt
      You did not specify a --stranded option for htseq-count which means it is using the default setting, --stranded=yes. There are three possible settings for --stranded (yes, no, reverse). This has to be set properly according to the protocol used to generate your RNA-Seq library. In fact --stranded=yes is probably the least likely. What kit/protocol was used to construct the library in this case?

      Comment


      • #4
        Hi Dario,
        thanks for your reply. Yes I've noticed the chromosome name issue and changed them to chr1, chr2, etc. Without changing them I could only get a result with all genes' read counts being zero, and all reads as "no_feature".

        I didn't use a reference fasta file. Should I use it, and where?

        If I have 11732479 "no_feature" fragments, then it's even worse, because the total number of my fragments is 14098588, so only 17% of them are mapped to exons.

        Originally posted by dariober View Post
        Hi- Do the chromosome names in the GTF file match those in the sam file? Ensembl files have chromosome names like 1, 2, 3 etc. while UCSC uses chr1, chr2, chr3 etc. Also, maybe check you are using the same version for both the reference fasta and the gtf (e.g. *both* are hg19).

        I guess 11732479 refers to number of fragments so a single end read counts as 1 and the two ends of a paired-end read count 1 as well.

        Good luck
        Dario

        Comment


        • #5
          Hi kmcarr,
          Thanks for pointing this out. I didn't specify the stranded option because my reads are stranded. I've actually tried all three settings: yes, no, and reverse, but none of them gave a number of "no_feature" reads that's significantly less than others.

          I don't actually know the kit used, because what I have is only the data from our sequencing facility, but their report says that my reads are stranded.

          Originally posted by kmcarr View Post
          You did not specify a --stranded option for htseq-count which means it is using the default setting, --stranded=yes. There are three possible settings for --stranded (yes, no, reverse). This has to be set properly according to the protocol used to generate your RNA-Seq library. In fact --stranded=yes is probably the least likely. What kit/protocol was used to construct the library in this case?

          Comment


          • #6
            I wonder though in everyone's experience, what's an "acceptable" percent of no features? Surely no organism's annotation is perfect? Plus how often is it that the RNA that you get is 100% mRNA

            In this nature paper they found about 86% mapped to known exons http://www.nature.com/nature/journal...ture08872.html

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM
            • seqadmin
              Techniques and Challenges in Conservation Genomics
              by seqadmin



              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

              Avian Conservation
              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
              03-08-2024, 10:41 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Yesterday, 06:37 PM
            0 responses
            10 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, Yesterday, 06:07 PM
            0 responses
            9 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-22-2024, 10:03 AM
            0 responses
            49 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-21-2024, 07:32 AM
            0 responses
            67 views
            0 likes
            Last Post seqadmin  
            Working...
            X