SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
htseq-count paolo.kunder Bioinformatics 10 10-22-2014 04:45 AM
Which ID should be used for HTSeq-count? syintel87 Bioinformatics 11 02-07-2013 12:16 AM
htseq-count performance dglemay Bioinformatics 8 10-23-2012 07:08 PM
Output of HTseq-count? kasutubh Bioinformatics 8 08-21-2012 10:39 PM
multiBamCov or htseq-count to count read per feature ? NicoBxl Bioinformatics 1 07-03-2012 02:05 AM

Reply
 
Thread Tools
Old 04-15-2014, 01:45 PM   #1
halffedelf
Newborn
 
Location: Toronto, ON, Canada

Join Date: Jul 2010
Posts: 11
Exclamation HTseq-count of zero

Hi I ran htseq on a number of RNA expression data, but I get a read count of zero for a certain transcript which I am sure should not be zero. The data are generated from multiple samples in multiple batches in multiple labs, so it's very unlikely that the data is wrong.

I would highly appreciate your expert opinion. What could possibly the reason for the count to be zero whereas I know this is the gene of interest and it should be expressed at various degrees in the different datasets of mine.

I generated a wiggle file from the bam and checked the number of peaks for the region of interest, there are peaks in the region (even though very weak at >50% positions).

Thanks!
halffedelf is offline   Reply With Quote
Old 04-15-2014, 01:58 PM   #2
halffedelf
Newborn
 
Location: Toronto, ON, Canada

Join Date: Jul 2010
Posts: 11
Default

these info might be of importance:
my gene of interest is a lncRNA gene.
my command was:
htseq-count -s no -a 10 filename.sam hg19_RefseqCoding_lncRNA.gtf >filename.count
halffedelf is offline   Reply With Quote
Old 04-15-2014, 02:21 PM   #3
Bukowski
Senior Member
 
Location: Aberdeen, Scotland

Join Date: Jan 2010
Posts: 388
Default

Maybe we should start with two more pieces of information - how do you know this transcript is expressed? And how many reads/read pairs do you have per sample?
Bukowski is offline   Reply With Quote
Old 04-15-2014, 02:56 PM   #4
Wallysb01
Senior Member
 
Location: San Francisco, CA

Join Date: Feb 2011
Posts: 286
Default

Have you actually looked at this transcript and the mappings in IGV?
Wallysb01 is offline   Reply With Quote
Old 04-15-2014, 04:57 PM   #5
blancha
Senior Member
 
Location: Montreal

Join Date: May 2013
Posts: 367
Default

htseq-count discards reads that cannot be unambiguously be assigned to a gene.

1) Open the GTF file in IGV, and check if there is more than one annotation at the same location.
If there is, htseqcount classifies the reads aligning at this location as "ambiguous".
These reads are not counted for the gene.

You can modify the GTF file to correct this problem, by removing multiple annotations at the same location.
Or, even better, do stranded RNA-Seq.
Often, you will have different annotations at the same location, but on different strands.
If your sequencing is strand-specific, htseq-count can distinguish between the 2 strands, and will not classify the reads as ambiguous.

2) Check the tag NH (Number of hits) for the reads aligning to the gene in question.
If it is more than one, htseq-count classifies these reads as multihits.
The reads are not counted for the gene in questions, but are counted as alignment_not_unique

If you know some Python, you can actually modify htseq-count to count multiple hits, which I have done on one occasion.
I still have the script if anyone needs it.
Only the NH parameter in the htseqcount script needs to be changed.

Last edited by blancha; 04-15-2014 at 04:59 PM.
blancha is offline   Reply With Quote
Old 04-15-2014, 05:17 PM   #6
Wallysb01
Senior Member
 
Location: San Francisco, CA

Join Date: Feb 2011
Posts: 286
Default

blancha, brings up good points. I would just like to clarify on part 1 there. For an entire gene or transcript to effected by that, the entire thing would have to be overlapping or have no reads completely contained in any mutually exclusive part of the transcript. That last part does depend on your settings however. You can count reads that map to features which partially overlap but are not completely within other features so long as you don’t use the ‘union’ option and your reads don’t span into portions of the non-overlapping features for *both* features.
Wallysb01 is offline   Reply With Quote
Old 04-17-2014, 07:47 AM   #7
halffedelf
Newborn
 
Location: Toronto, ON, Canada

Join Date: Jul 2010
Posts: 11
Default

thanks everyone for your input.
This is how my gtf file and bam file look like in IGV:


The RNA-seq was strand specific so I guess multiple annotation for the same locus was not the issue here.

All my samples have read count of >50M.

PCAT1 is a well-studied gene and it is expressed in the tissues I am working with. You can see from the bam file that there are reads in the exon region.
halffedelf is offline   Reply With Quote
Old 04-17-2014, 09:37 AM   #8
blancha
Senior Member
 
Location: Montreal

Join Date: May 2013
Posts: 367
Default

@halffedelf: You twice have the same annotation at the same location, and on the same strand: NR_045262, and ENST00000561978. htseq-count will count these reads as ambiguous.

Edit your GTF file, find a GTF file without redundant annotations, or use Cufflinks which will assign the reads to both NR_045262, and ENST00000561978.

If your RNA-Seq is stranded, you should also specify s=y or reverse, depending on how the library was prepared. You're losing a lot of information by setting s to no. With s=n, htseq-count is ignoring the strand.
blancha is offline   Reply With Quote
Old 04-17-2014, 09:43 AM   #9
halffedelf
Newborn
 
Location: Toronto, ON, Canada

Join Date: Jul 2010
Posts: 11
Default

Thanks
I am already looking for a clean gtf with refgene and lncRNA only.
I will also run htseq again with s=y.

Quote:
Originally Posted by blancha View Post
@halffedelf: You twice have the same annotation at the same location, and on the same strand: NR_045262, and ENST00000561978. htseq-count will count these reads as ambiguous.

Edit your GTF file, find a GTF file without redundant annotations, or use Cufflinks which will assign the reads to both NR_045262, and ENST00000561978.

If your RNA-Seq is stranded, you should also specify s=y or reverse, depending on how the library was prepared. You're losing a lot of information by setting s to no. With s=n, htseq-count is ignoring the strand.
halffedelf is offline   Reply With Quote
Old 04-17-2014, 09:46 AM   #10
blancha
Senior Member
 
Location: Montreal

Join Date: May 2013
Posts: 367
Default

The peaks are small, dispersed, and full of mutations. I'm not sure one could concluded the gene is expressed based on this alignment, even if the reads were counted correctly.
Although Cufflinks would not have the same problem with the redundant annotation as htseq-count, it would not calculate a significant FPKM for this transcript, given the distribution profile of the aligned reads.

Last edited by blancha; 04-17-2014 at 09:51 AM.
blancha is offline   Reply With Quote
Old 04-17-2014, 09:49 AM   #11
halffedelf
Newborn
 
Location: Toronto, ON, Canada

Join Date: Jul 2010
Posts: 11
Default

EDIT: Please disregard my following question. You could tell that from the peak bar on top.

How could you tell about the mutation? Where does IGV show the mutation info from there?
The expression could be low level, I am trying to obtain a differental expression of the gene between tumor cell and control.

Quote:
Originally Posted by blancha View Post
The peaks are small, dispersed, and full of mutations. I'm not sure one could concluded the gene is expressed based on this alignment, even if the reads were counted correctly.

Last edited by halffedelf; 04-17-2014 at 09:57 AM.
halffedelf is offline   Reply With Quote
Old 04-17-2014, 10:03 AM   #12
blancha
Senior Member
 
Location: Montreal

Join Date: May 2013
Posts: 367
Default

The colored vertical lines indicate mutations. The mutations can be seen better with a zoom in. The mutations could be real, and biologically significant, or they could be alignment mistakes. When you have this many mutations, and dispersed peaks, it is more likely to be alignment mistakes.

I couldn't concluded based on this alignment that the transcript is expressed.
To be fair, I couldn't conclude that the transcript is not expressed either.

If the transcript is lowly expressed, you'll need a higher sequencing depth to come to a definitive conclusion.
blancha is offline   Reply With Quote
Old 04-17-2014, 10:07 AM   #13
halffedelf
Newborn
 
Location: Toronto, ON, Canada

Join Date: Jul 2010
Posts: 11
Default

I understand what you mean. I am just not sure what alignment mistakes could this be, or how I could take care of it. May be the sequencing quality was not good hence the higher number of wrong base calls?
halffedelf is offline   Reply With Quote
Old 04-17-2014, 10:26 AM   #14
blancha
Senior Member
 
Location: Montreal

Join Date: May 2013
Posts: 367
Default

I would need more information. There are many possibilities.

What is the RIN (RNA Integrity Number)?
If the RNA is of poor quality at the start, it will result in bad alignments.

What is the output of FASTQC?
If there are many poor quality bases that were not trimmed, it will result in bad alignments.

Was trimming performed?
The quality of the alignment can improved by removing poor quality bases, and adapter sequences. Trimming is only required if the output of FASTQC shows poor quality bases or adapter sequences. Trimming too aggressively can actually give results of lesser quality.

What is the output of RNASeQC?
RNASeQC will calculate a number of useful metrics on the quality of the alignments.

What is the sequencing depth?
It will be difficult to come to firm conclusions on lowly expressed transcripts if the sequencing depth is too low.
blancha is offline   Reply With Quote
Old 04-17-2014, 10:31 AM   #15
blancha
Senior Member
 
Location: Montreal

Join Date: May 2013
Posts: 367
Default

I forgot to mention the obvious.

Are there proper peaks for other transcripts? You could just check a housekeeping gene for example to see what the peaks resemble for a transcript that is known to be highly expressed.
blancha is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:18 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO