Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • HTseq-count of zero

    Hi I ran htseq on a number of RNA expression data, but I get a read count of zero for a certain transcript which I am sure should not be zero. The data are generated from multiple samples in multiple batches in multiple labs, so it's very unlikely that the data is wrong.

    I would highly appreciate your expert opinion. What could possibly the reason for the count to be zero whereas I know this is the gene of interest and it should be expressed at various degrees in the different datasets of mine.

    I generated a wiggle file from the bam and checked the number of peaks for the region of interest, there are peaks in the region (even though very weak at >50% positions).

    Thanks!

  • #2
    these info might be of importance:
    my gene of interest is a lncRNA gene.
    my command was:
    htseq-count -s no -a 10 filename.sam hg19_RefseqCoding_lncRNA.gtf >filename.count

    Comment


    • #3
      Maybe we should start with two more pieces of information - how do you know this transcript is expressed? And how many reads/read pairs do you have per sample?

      Comment


      • #4
        Have you actually looked at this transcript and the mappings in IGV?

        Comment


        • #5
          htseq-count discards reads that cannot be unambiguously be assigned to a gene.

          1) Open the GTF file in IGV, and check if there is more than one annotation at the same location.
          If there is, htseqcount classifies the reads aligning at this location as "ambiguous".
          These reads are not counted for the gene.

          You can modify the GTF file to correct this problem, by removing multiple annotations at the same location.
          Or, even better, do stranded RNA-Seq.
          Often, you will have different annotations at the same location, but on different strands.
          If your sequencing is strand-specific, htseq-count can distinguish between the 2 strands, and will not classify the reads as ambiguous.

          2) Check the tag NH (Number of hits) for the reads aligning to the gene in question.
          If it is more than one, htseq-count classifies these reads as multihits.
          The reads are not counted for the gene in questions, but are counted as alignment_not_unique

          If you know some Python, you can actually modify htseq-count to count multiple hits, which I have done on one occasion.
          I still have the script if anyone needs it.
          Only the NH parameter in the htseqcount script needs to be changed.
          Last edited by blancha; 04-15-2014, 04:59 PM.

          Comment


          • #6
            blancha, brings up good points. I would just like to clarify on part 1 there. For an entire gene or transcript to effected by that, the entire thing would have to be overlapping or have no reads completely contained in any mutually exclusive part of the transcript. That last part does depend on your settings however. You can count reads that map to features which partially overlap but are not completely within other features so long as you don’t use the ‘union’ option and your reads don’t span into portions of the non-overlapping features for *both* features.

            Comment


            • #7
              thanks everyone for your input.
              This is how my gtf file and bam file look like in IGV:


              The RNA-seq was strand specific so I guess multiple annotation for the same locus was not the issue here.

              All my samples have read count of >50M.

              PCAT1 is a well-studied gene and it is expressed in the tissues I am working with. You can see from the bam file that there are reads in the exon region.

              Comment


              • #8
                @halffedelf: You twice have the same annotation at the same location, and on the same strand: NR_045262, and ENST00000561978. htseq-count will count these reads as ambiguous.

                Edit your GTF file, find a GTF file without redundant annotations, or use Cufflinks which will assign the reads to both NR_045262, and ENST00000561978.

                If your RNA-Seq is stranded, you should also specify s=y or reverse, depending on how the library was prepared. You're losing a lot of information by setting s to no. With s=n, htseq-count is ignoring the strand.

                Comment


                • #9
                  Thanks
                  I am already looking for a clean gtf with refgene and lncRNA only.
                  I will also run htseq again with s=y.

                  Originally posted by blancha View Post
                  @halffedelf: You twice have the same annotation at the same location, and on the same strand: NR_045262, and ENST00000561978. htseq-count will count these reads as ambiguous.

                  Edit your GTF file, find a GTF file without redundant annotations, or use Cufflinks which will assign the reads to both NR_045262, and ENST00000561978.

                  If your RNA-Seq is stranded, you should also specify s=y or reverse, depending on how the library was prepared. You're losing a lot of information by setting s to no. With s=n, htseq-count is ignoring the strand.

                  Comment


                  • #10
                    The peaks are small, dispersed, and full of mutations. I'm not sure one could concluded the gene is expressed based on this alignment, even if the reads were counted correctly.
                    Although Cufflinks would not have the same problem with the redundant annotation as htseq-count, it would not calculate a significant FPKM for this transcript, given the distribution profile of the aligned reads.
                    Last edited by blancha; 04-17-2014, 09:51 AM.

                    Comment


                    • #11
                      EDIT: Please disregard my following question. You could tell that from the peak bar on top.

                      How could you tell about the mutation? Where does IGV show the mutation info from there?
                      The expression could be low level, I am trying to obtain a differental expression of the gene between tumor cell and control.

                      Originally posted by blancha View Post
                      The peaks are small, dispersed, and full of mutations. I'm not sure one could concluded the gene is expressed based on this alignment, even if the reads were counted correctly.
                      Last edited by halffedelf; 04-17-2014, 09:57 AM.

                      Comment


                      • #12
                        The colored vertical lines indicate mutations. The mutations can be seen better with a zoom in. The mutations could be real, and biologically significant, or they could be alignment mistakes. When you have this many mutations, and dispersed peaks, it is more likely to be alignment mistakes.

                        I couldn't concluded based on this alignment that the transcript is expressed.
                        To be fair, I couldn't conclude that the transcript is not expressed either.

                        If the transcript is lowly expressed, you'll need a higher sequencing depth to come to a definitive conclusion.

                        Comment


                        • #13
                          I understand what you mean. I am just not sure what alignment mistakes could this be, or how I could take care of it. May be the sequencing quality was not good hence the higher number of wrong base calls?

                          Comment


                          • #14
                            I would need more information. There are many possibilities.

                            What is the RIN (RNA Integrity Number)?
                            If the RNA is of poor quality at the start, it will result in bad alignments.

                            What is the output of FASTQC?
                            If there are many poor quality bases that were not trimmed, it will result in bad alignments.

                            Was trimming performed?
                            The quality of the alignment can improved by removing poor quality bases, and adapter sequences. Trimming is only required if the output of FASTQC shows poor quality bases or adapter sequences. Trimming too aggressively can actually give results of lesser quality.

                            What is the output of RNASeQC?
                            RNASeQC will calculate a number of useful metrics on the quality of the alignments.

                            What is the sequencing depth?
                            It will be difficult to come to firm conclusions on lowly expressed transcripts if the sequencing depth is too low.

                            Comment


                            • #15
                              I forgot to mention the obvious.

                              Are there proper peaks for other transcripts? You could just check a housekeeping gene for example to see what the peaks resemble for a transcript that is known to be highly expressed.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              66 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X