Hi guys,
I'm really confused by the numbers between tophat & htseq-count.
There was a huge difference between __alignment_not_unique from htseq-count and unmapped reads from tophat. Here's the data:
I mapped single-end, 50-bp short reads to mouse genome using tophat2.
alignment_summary.txt from tophat showed this statistics below:
Reads:
Input : 59054037
Mapped : 57231649 (96.9% of input)
of these: 10102891 (17.7%) have multiple alignments (5593 have >20)
96.9% overall read mapping rate.
However, when I used htseq-count (v0.6.1p1, -s no) to do a raw count from accepted_hits.bam. I got a large number of non-unique alignment. See below:
__no_feature 9924659
__ambiguous 1401163
__too_low_aQual 0
__not_aligned 0
__alignment_not_unique 25773467
unmappable reads from tophat = 59045037-57231649 ~= 1.8M, but htseq-count was 25M
Does it make sense to compare these two numbers? What might be wrong? BTW, both tophat and htseq-count used the same genes.gtf.
Also, when I added up the numbers from the log file produced by htseq-count, the counts added up ~73M (mapped+no feature+ambiguous+not_unique), which was much higher than the 59M reads printed from tophat (and fastqc). Why?
Many thanks.
I'm really confused by the numbers between tophat & htseq-count.
There was a huge difference between __alignment_not_unique from htseq-count and unmapped reads from tophat. Here's the data:
I mapped single-end, 50-bp short reads to mouse genome using tophat2.
alignment_summary.txt from tophat showed this statistics below:
Reads:
Input : 59054037
Mapped : 57231649 (96.9% of input)
of these: 10102891 (17.7%) have multiple alignments (5593 have >20)
96.9% overall read mapping rate.
However, when I used htseq-count (v0.6.1p1, -s no) to do a raw count from accepted_hits.bam. I got a large number of non-unique alignment. See below:
__no_feature 9924659
__ambiguous 1401163
__too_low_aQual 0
__not_aligned 0
__alignment_not_unique 25773467
unmappable reads from tophat = 59045037-57231649 ~= 1.8M, but htseq-count was 25M
Does it make sense to compare these two numbers? What might be wrong? BTW, both tophat and htseq-count used the same genes.gtf.
Also, when I added up the numbers from the log file produced by htseq-count, the counts added up ~73M (mapped+no feature+ambiguous+not_unique), which was much higher than the 59M reads printed from tophat (and fastqc). Why?
Many thanks.
Comment