Seqanswers Leaderboard Ad

**cmbetts** · 07-23-2015, 09:11 AM

Originally posted by chumho View Post

Hi guys,

I'm really confused by the numbers between tophat & htseq-count.

There was a huge difference between __alignment_not_unique from htseq-count and unmapped reads from tophat. Here's the data:

I mapped single-end, 50-bp short reads to mouse genome using tophat2.

alignment_summary.txt from tophat showed this statistics below:

Reads:
Input : 59054037
Mapped : 57231649 (96.9% of input)
of these: 10102891 (17.7%) have multiple alignments (5593 have >20)
96.9% overall read mapping rate.

However, when I used htseq-count (v0.6.1p1, -s no) to do a raw count from accepted_hits.bam. I got a large number of non-unique alignment. See below:

__no_feature 9924659
__ambiguous 1401163
__too_low_aQual 0
__not_aligned 0
__alignment_not_unique 25773467

unmappable reads from tophat = 59045037-57231649 ~= 1.8M, but htseq-count was 25M

Does it make sense to compare these two numbers? What might be wrong? BTW, both tophat and htseq-count used the same genes.gtf.

Also, when I added up the numbers from the log file produced by htseq-count, the counts added up ~73M (mapped+no feature+ambiguous+not_unique), which was much higher than the 59M reads printed from tophat (and fastqc). Why?

Many thanks.

The tophat output says that 17.7% of your reads ~10M have multiple alignments. Those will all get counted as _alignment_no_unique since they can't be assigned to a unique site in the genome. Additionally, they'll show up multiple times in the sam/bam file, once for each possible alignment, which is why you can have more overall counts from htseq-count than you have reads.

**chumho** · 07-23-2015, 10:13 AM

Thanks cmbetts for the reply.

I agree that "The tophat output says that 17.7% of your reads ~10M have multiple alignments. Those will all get counted as _alignment_no_unique since they can't be assigned to a unique site in the genome." So that explains the 1.8M I calculated.

"Additionally, they'll show up multiple times in the sam/bam file, once for each possible alignment, which is why you can have more overall counts from htseq-count than you have reads." Are you saying the unit of "__alignment_not_unique" is times instead of reads? E.g. One non-unique read mapped 6 locations will be added 6 times in "__alignment_not_unique" by htseq-count but it is counted as 1 by tophat?

**cmbetts** · 07-23-2015, 02:16 PM

Think that you're misinterpreting the tophat output. You have 1.8M unmapped reads. Tophat doesn't even put those in the bam file, and if it did, they would be counted as _not_aligned.
While tophat was able to map 96.9% of your reads (57.2M), 17.7% of those (10.1M) are not uniquely aligned, meaning that they have equally good alignments to two or more places in the genome. For those reads, they will have as many occurrences in the bam file as they have valid alignments, which means the bam has a minimum of 20.2M occurrences of those reads and some of those reads will have more than just two alignments, bringing the total up to the 25.7M seen by running htseq-count.

from the htseq-count FAQ:
Why is the sum of all counts different from the number of reads in my FASTQ file?
A read with more than one reported alignment appears only once in the FASTQ file, but several times in the SAM file (once for each alignment), and each time htseq-count encounters one, it increased the __alignment_not_unique counter by one. Therefore, mutiply aligned reads are counted multiple times.

**chumho** · 07-24-2015, 06:20 AM

Thanks. It makes sense now.

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

htseq-count __alignment_not_unique vs tophat

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News