Seqanswers Leaderboard Ad

**Simon Anders** · 01-28-2011, 05:55 AM

HTSeq-count can be quite conservative in the sense that it discards reads that cannot be unambiguously be assigned to a gene. This can cause values to be lower than reported by other tools. (This is done on purpose: For differential expression analysis, you need to discount ambiguous reads to avoid that differential signal on one gene shows up in another one that overlaps.)

The gene FBgn0000003 is an extremely short ncRNA (less than 300 bp), so it is well possible that very few reads, maybe just a single one, give rise to a rather high FPKM value. If you now wonder why these few reads were discounted by HTSeq and counted by cufflinks, display your SAM file in a genome browser (e.g., IGV) and have a look at the FBgn0000003 locus. How many reads do you see there? Do they maybe overlap with more than one feature? If you write down these reads' IDs, you can then use HTSeq-count's new '-o' option to check out what these reads were assigned to.

Finally: Your gene is on the '+' strand. By default, htseq-count only counts reads that were aligned to the same strand as the feature. This makes sense if your RNA-Seq protocol poreserves strand information. Remember that Illumina's default protocol does not do this, and for such data, you must specify the option '--stranded=no' to get htseq-count to count reads on either strand.

If this does not resolve the issue, post a screen shot from your genome browser, showing a disputed region, and maybe I can spot something.

Simon

**gen2prot** · 01-28-2011, 02:17 PM

Hi Simon,

Thanks for the input. Yes I ran the program with --stranded=no, and immediately the no_feature dropped from ~375000 to ~185000. Ambiguos and alignment_not_unique did not change much.

About the gene "FBgn0000003", I find that a lot of reads are associated with it. I am sending a screenshot. I investigated a few (10 reads), and find that they are "alignment_not_unique". Probably that is why htseq-count gives zero. But why does Cufflinks provide expression value for this? I would have thought that they account for ambiguity.

Thanks
Abhijit

Attached Files

igv_snapshot.pdf (229.7 KB, 284 views)

**Simon Anders** · 01-30-2011, 05:07 AM

Originally posted by gen2prot View Post

But why does Cufflinks provide expression value for this? I would have thought that they account for ambiguity.

Interesting question. Maybe send a mail to the Cufflinks developers and ask. If you do, please tell us what they say.

But note that this is not necessarily a bug. Depending on what you want to do with your data, I see two rationales on how to deal with reads that map to multiple places.

1. For htseq-count, I imagined the user to then use DESeq (or edgeR) to test for differential expression. Imagine we have two paralogous genes that have identical sequence at one half or their length and divergent sequence at the other half, and one of these genes is differentially expressed and the other is not. All reads that stem from the identical-sequence parts of the transcripts will map to both genes, and if we include them in our counts, both genes will appear to be differentially expressed, even though only one is really. If we count only the uniquely mapping reads (i.e., those stemming from the divergent parts of the transcripts), we are safe.

2. It seems to me that the cufflinks developers rather had in mind that the user has one sample and wishes to study which genes are strongly expressed and which are not. If we disregard non-uniquely mapped reads, all those genes that have highly similar paralogs will get less counts than genes that have diverged a lot since their last duplication event, i.e. the counts are biased according to the gene's within-genome evolutionary history.

This is especially an issue because cufflinks reports FPKM values, i.e., it divides by transcript length. Note that mappability is a feature of a the reference, not of the read. If a part of a gene's sequence appears somewhere else in the genome, no read will map there uniquely, i.e., the part cannot get any reads. Hence it should be subtracted from the transcript length when calculating FPKM values. Alternatively, one could distribute non-uniquely mapped reads to the different places, i.e., count them fractionally for each of their possible mappings. Maybe, this is what cufflinks does, but it makes the job of differential expression calling rather difficult, and I wonder how cuffdiff could take it into account.

Simon

**lpachter** · 01-30-2011, 08:38 AM

Cufflinks currently divides multi-reads (i.e. reads that map to multiple locations) uniformly in calculating expression values. This is not correct, however it is worth pointing out that (probabilistic) assignment of multi-reads is mathematically equivalent to the assignment of reads that map uniquely but to multiple isoforms. In fact, the RSEM program correctly handles multi-reads. We have worked out an effective strategy for probabilistically assigning multi-reads within the Cufflinks framework and the implementation is scheduled for the release-after-next (hopefully ~1 month away).

We have procrastinated dealing with multi-reads because as read length has increased they have been less of a problem. However they still affect expression estimates in gene families, and we are sensitive to the fact that many users still use short reads. In terms of differential expression estimates, there is nothing difficult about incorporating probabilistic assignment of multi-reads- it works exactly the same way as the current method implemented in cuffdiff, even with replicates.

Lior
P.S. Actually mappability is not a feature of the reference- it is a feature of the transcriptome. A read may map to multiple locations in a reference, but it may be possible to assign it to only one location if the other transcripts have very low expression (that can be estimated using other reads that map to them). In other words, mappability and expression estimation are intimately related, and fractional assignment of mappings is not an option, it is essential if one doesn't want to lose power by throwing away data.

**gen2prot** · 01-31-2011, 10:16 AM

Hi Simon and Lior,

Thank you for your thoughts. What I am trying to do is find a ratio between the autosomal expression in wild type male of drosophila and a mutant. We have reason to believe that global autosomal gene expression is elevated in the mutant. Therefore even if we wrongly call differential gene expression for some genes, on a global level it should not matter much. However, mapping reads correctly to call DGE is still important to me looking ahead. I had therefore tried to use as reference the transcripts of drosophila, rather than the genome. I was not concerned with finding novel transcripts or genes. Anyway, I used bowtie-build on the transcripts, then used tophat on the generated indices. I then designed a perl script which would count as "one" any read that mapped to two or more transcripts of the same gene, and would disregard any read which hit two transcripts belonging to different genes (I did not have a formatted -GTF file at the time, because then I could have just used the -GTF option of tophat on the reference genome). I find that the results by running against the transcripts is not very different (when it comes to ratios) from when run against the genome.

But I agree that the problem of "counts biased according to the genes within-genome evolutionary history" and also biased if a domain is the same between two completely unrelated genes will still remain. Mate-pairs and longer read lengths seem to be the only ways to lessen this problem.

Thanks
Abhijit

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 13 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

HTSeq output not correlated with Cufflinks output... Help

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News