Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
How to generate counts/statistics such as % mapped to known exons f/ cufflinks output mmcgo002 Bioinformatics 13 02-03-2016 10:43 PM
Using trimmed reads with HTseq-count before DESeq? aafc Bioinformatics 0 12-04-2012 11:24 AM
BWA rescue of multi-mapping or unmapped reads kjlee Bioinformatics 6 07-13-2012 10:29 PM
analysis of single and PE reads with htseq-count and DESeq dglemay Bioinformatics 1 04-26-2012 11:23 PM
DESeq: Read counts vs. BP counts burkard Bioinformatics 0 08-05-2010 11:52 PM

Thread Tools
Old 01-18-2013, 11:05 AM   #1
Hilary April Smith
Location: USA

Join Date: Jul 2011
Posts: 22
Default How to rescue multi-reads when using htseq to generate edgeR/DESeq counts?

Hi. After using Bowtie2/Tophat2 for mapping an RNA-Seq study, I ran my data through htseq and then have aimed for methods like edgeR/DESeq that allow a neg. binomial distribution -- I gathered for differential gene expression, this may be more appropriate than Cufflinks/CuffDiff...?

Yet because we just have Single End reads (100 bp HiSeq), there are a lot of multi-reads (=multi-hits / multi-mapped reads). HTSeq scores these as ambiguous and discards them; I'd like to rescue the multi-reads. Yet the Mortazavi method E-RANGE uses RPKM, and I'm unsure if I could reliably convert it back to raw counts for edgeR/DESeq. I found a newer package (BM-MAP) but am concerned (1) of the level of dependence it has on the estimate of polymorphism rates (if this is critical -- that could be a problem because I'm not sure if I can ascertain too reliably transcriptome-wide), and (2) I'm not sufficiently adept at coding to know how to translate the sam+ file it generates (appending a mapping probability to the end of each line in the sam file) into something I can put through htseq or some other program to generate counts, or to write my own script to do this. (3) Even if I CAN make this work, is it viable (or is the uncertainty due to using a Probability for multi-mapped, enough to make the DESeq/edgeR output unreliable)?

I'm tempted to just run Cuffdiff instead -- to avoid losing the potential information gleaned by including multi-reads (for downstream analysis it may be enough to know if a certain gene family is up/down-regulated, even if we can't get at the exact gene). Yet again I gather that may be more for isoform-specific comparisons vs differential Gene expression?

If anyone has time, suggestions are very welcome!
Hilary April Smith is offline   Reply With Quote
Old 05-02-2013, 10:41 PM   #2
Location: Leeds

Join Date: Jan 2010
Posts: 13

Hi Hilary, I am facing the same problem. At first I wrote a wrapper for a programme called SEQEM (, which worked fine until the amount of data we were producing became too much for the programme *Snag 1* I was then able to get BM-MAP to work and use the sam+ files to assign the most probable location, but it was also unable to cope with the level of data that we are producing from a HiSeq2000 run. *Snag 2*

One possibility for you would be to use RSEM. This uses an expectation maximisation approach to assign multiply mapped reads (for which the 'expected_count' output can be used in edgeR/DESeq or the programme EBSeq can, apparently, deal with RSEM output to give you DEGs). However, with RSEM you have to align to a transcriptome unless your genome is unspliced, as RSEM cannot cope with spliced reads. *Snag 3*

If you are not using matched samples then I believe that Cuffdiff would be a very good choice for you as it can deal with gene-level as well as isoform-level. My issue is that I have paired samples (disease and non-disease tissues from the same patient) and Cuffdiff does not use this info, meaning I don't get the power I want from the study. *Snag 4* I am, therefore, wanting to use edgeR or DESeq... which means I am back to square one with how to assign multireads so I can get my count data....

I am going to attempt to use Cufflinks to assign multiply mapped reads in the probabilistic manner it employs using the -u parameter, and then use the output FPKM values, per sample, to get count data (i.e. multiply by gene length and reads mapped) and then make use of the paired tests in edgeR (*phew*) but if anyone else has suggestions/comments regarding the issue of both a) using multireads in my analysis and b) performing tests on matched samples, I would love to hear them.

As an aside, I am also using RSEM so I can see how the 2 different approaches affects the lists of significantly DEGs I get at the end. If you are interested in the results I will post them here


EDIT: I have realised that there is a raw_frags column in the read_group_tracking output from cuffdiff so I am going to use that rather than fudging the FPKM

Last edited by SEQquestions; 05-02-2013 at 11:05 PM. Reason: spelling mistakes and a little more info needed
SEQquestions is offline   Reply With Quote
Old 05-06-2013, 08:09 AM   #3
Hilary April Smith
Location: USA

Join Date: Jul 2011
Posts: 22

Hi. Statistically our design is more complex than pairwise comparisons, so we also have trouble applying Cufflinks/Cuffdiff. I'm not (yet ... pending if I can find the time to teach myself) adept at perl/python parsing so I admit I was also a bit swayed against BM-MAP due to being unsure of an efficient way to convert the sam+ into a usable, standard sam format.

I know there are some issues in converting from the Cufflinks output to edgeR and other count-based approaches ( Though I think there's some option for raw counts now vs FPKM in one of the outputs.
If memory serves me right, RSEM also uses the FPKM approach and wouldn't be easily compatible with edgeR/DESeq.

So I have yet to find a good solution. I was thinking of a hybrid approach: basing most analysis on the more conservative (and hopefully easily defensible) route of discarding multi-mapped reads (using htseq with the "union" option, then edgeR) and just for comparison, perhaps also reporting what a Cufflinks/Cuffdiff run would yield. It's not ideal, especially because our data is single-end so of course we can't assign all reads -- and for our purposes it'd be good enough to know if a read belonged to gene family X (or to gene Y vs knowing the exact isoform -- we're looking for the big picture first).

Thank you for your insight. The Tuxedo suite/Cufflinks has some really nice features, as does the htseq/edgeR/DESeq route; it's too bad there isn't a way to combine the best of both worlds. I think the fact that the uncertainty from mapping multi-hits isn't accounted for with edgeR etc., also adds a level of error -- and that the extent of the possible error / P value inflation isn't known (unless someone has done a simulation study I'm not aware of ...).
Hilary April Smith is offline   Reply With Quote
Old 05-06-2013, 11:07 AM   #4
Simon Anders
Senior Member
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994

See here for an explanation of the rational behind htseq-count's discarding of non-unique alignments:
Simon Anders is offline   Reply With Quote

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

All times are GMT -8. The time now is 08:27 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO