SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
htseq-count paolo.kunder Bioinformatics 10 10-22-2014 05:45 AM
htseq-count low count problem gandalf886 Bioinformatics 3 08-23-2014 08:05 AM
HTseq-count of zero halffedelf RNA Sequencing 14 04-17-2014 11:31 AM
htseq-count error with tophat input moser Introductions 4 10-21-2013 01:35 AM
multiBamCov or htseq-count to count read per feature ? NicoBxl Bioinformatics 1 07-03-2012 03:05 AM

Reply
 
Thread Tools
Old 07-23-2015, 09:41 AM   #1
chumho
Junior Member
 
Location: USA

Join Date: Jan 2013
Posts: 4
Default htseq-count __alignment_not_unique vs tophat

Hi guys,

I'm really confused by the numbers between tophat & htseq-count.

There was a huge difference between __alignment_not_unique from htseq-count and unmapped reads from tophat. Here's the data:

I mapped single-end, 50-bp short reads to mouse genome using tophat2.

alignment_summary.txt from tophat showed this statistics below:

Reads:
Input : 59054037
Mapped : 57231649 (96.9% of input)
of these: 10102891 (17.7%) have multiple alignments (5593 have >20)
96.9% overall read mapping rate.

However, when I used htseq-count (v0.6.1p1, -s no) to do a raw count from accepted_hits.bam. I got a large number of non-unique alignment. See below:

__no_feature 9924659
__ambiguous 1401163
__too_low_aQual 0
__not_aligned 0
__alignment_not_unique 25773467

unmappable reads from tophat = 59045037-57231649 ~= 1.8M, but htseq-count was 25M

Does it make sense to compare these two numbers? What might be wrong? BTW, both tophat and htseq-count used the same genes.gtf.

Also, when I added up the numbers from the log file produced by htseq-count, the counts added up ~73M (mapped+no feature+ambiguous+not_unique), which was much higher than the 59M reads printed from tophat (and fastqc). Why?

Many thanks.
chumho is offline   Reply With Quote
Old 07-23-2015, 10:11 AM   #2
cmbetts
Member
 
Location: Bay Area

Join Date: Jun 2012
Posts: 93
Default

Quote:
Originally Posted by chumho View Post
Hi guys,

I'm really confused by the numbers between tophat & htseq-count.

There was a huge difference between __alignment_not_unique from htseq-count and unmapped reads from tophat. Here's the data:

I mapped single-end, 50-bp short reads to mouse genome using tophat2.

alignment_summary.txt from tophat showed this statistics below:

Reads:
Input : 59054037
Mapped : 57231649 (96.9% of input)
of these: 10102891 (17.7%) have multiple alignments (5593 have >20)
96.9% overall read mapping rate.

However, when I used htseq-count (v0.6.1p1, -s no) to do a raw count from accepted_hits.bam. I got a large number of non-unique alignment. See below:

__no_feature 9924659
__ambiguous 1401163
__too_low_aQual 0
__not_aligned 0
__alignment_not_unique 25773467

unmappable reads from tophat = 59045037-57231649 ~= 1.8M, but htseq-count was 25M

Does it make sense to compare these two numbers? What might be wrong? BTW, both tophat and htseq-count used the same genes.gtf.

Also, when I added up the numbers from the log file produced by htseq-count, the counts added up ~73M (mapped+no feature+ambiguous+not_unique), which was much higher than the 59M reads printed from tophat (and fastqc). Why?

Many thanks.
The tophat output says that 17.7% of your reads ~10M have multiple alignments. Those will all get counted as _alignment_no_unique since they can't be assigned to a unique site in the genome. Additionally, they'll show up multiple times in the sam/bam file, once for each possible alignment, which is why you can have more overall counts from htseq-count than you have reads.
cmbetts is offline   Reply With Quote
Old 07-23-2015, 11:13 AM   #3
chumho
Junior Member
 
Location: USA

Join Date: Jan 2013
Posts: 4
Default

Thanks cmbetts for the reply.

I agree that "The tophat output says that 17.7% of your reads ~10M have multiple alignments. Those will all get counted as _alignment_no_unique since they can't be assigned to a unique site in the genome." So that explains the 1.8M I calculated.

"Additionally, they'll show up multiple times in the sam/bam file, once for each possible alignment, which is why you can have more overall counts from htseq-count than you have reads." Are you saying the unit of "__alignment_not_unique" is times instead of reads? E.g. One non-unique read mapped 6 locations will be added 6 times in "__alignment_not_unique" by htseq-count but it is counted as 1 by tophat?
chumho is offline   Reply With Quote
Old 07-23-2015, 03:16 PM   #4
cmbetts
Member
 
Location: Bay Area

Join Date: Jun 2012
Posts: 93
Default

Think that you're misinterpreting the tophat output. You have 1.8M unmapped reads. Tophat doesn't even put those in the bam file, and if it did, they would be counted as _not_aligned.
While tophat was able to map 96.9% of your reads (57.2M), 17.7% of those (10.1M) are not uniquely aligned, meaning that they have equally good alignments to two or more places in the genome. For those reads, they will have as many occurrences in the bam file as they have valid alignments, which means the bam has a minimum of 20.2M occurrences of those reads and some of those reads will have more than just two alignments, bringing the total up to the 25.7M seen by running htseq-count.

from the htseq-count FAQ:
Why is the sum of all counts different from the number of reads in my FASTQ file?
A read with more than one reported alignment appears only once in the FASTQ file, but several times in the SAM file (once for each alignment), and each time htseq-count encounters one, it increased the __alignment_not_unique counter by one. Therefore, mutiply aligned reads are counted multiple times.
cmbetts is offline   Reply With Quote
Old 07-24-2015, 07:20 AM   #5
chumho
Junior Member
 
Location: USA

Join Date: Jan 2013
Posts: 4
Default

Thanks. It makes sense now.
chumho is offline   Reply With Quote
Reply

Tags
htseq-count, tophat, __alignment_not_unique

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:04 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO