View Single Post
Old 07-23-2015, 09:11 AM   #2
Senior Member
Location: Bay Area

Join Date: Jun 2012
Posts: 105

Originally Posted by chumho View Post
Hi guys,

I'm really confused by the numbers between tophat & htseq-count.

There was a huge difference between __alignment_not_unique from htseq-count and unmapped reads from tophat. Here's the data:

I mapped single-end, 50-bp short reads to mouse genome using tophat2.

alignment_summary.txt from tophat showed this statistics below:

Input : 59054037
Mapped : 57231649 (96.9% of input)
of these: 10102891 (17.7%) have multiple alignments (5593 have >20)
96.9% overall read mapping rate.

However, when I used htseq-count (v0.6.1p1, -s no) to do a raw count from accepted_hits.bam. I got a large number of non-unique alignment. See below:

__no_feature 9924659
__ambiguous 1401163
__too_low_aQual 0
__not_aligned 0
__alignment_not_unique 25773467

unmappable reads from tophat = 59045037-57231649 ~= 1.8M, but htseq-count was 25M

Does it make sense to compare these two numbers? What might be wrong? BTW, both tophat and htseq-count used the same genes.gtf.

Also, when I added up the numbers from the log file produced by htseq-count, the counts added up ~73M (mapped+no feature+ambiguous+not_unique), which was much higher than the 59M reads printed from tophat (and fastqc). Why?

Many thanks.
The tophat output says that 17.7% of your reads ~10M have multiple alignments. Those will all get counted as _alignment_no_unique since they can't be assigned to a unique site in the genome. Additionally, they'll show up multiple times in the sam/bam file, once for each possible alignment, which is why you can have more overall counts from htseq-count than you have reads.
cmbetts is offline   Reply With Quote