Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
htseq-count paolo.kunder Bioinformatics 10 10-22-2014 04:45 AM
Strange error when using htseq-count shhuang Bioinformatics 13 11-19-2012 12:40 AM
htseq-count error sissi Bioinformatics 0 03-20-2012 11:40 PM
htseq-count INPUT to big. Palgrave Bioinformatics 3 03-08-2012 04:45 AM
htseq-count output Palgrave Bioinformatics 7 03-05-2012 07:04 AM

Thread Tools
Old 03-27-2012, 10:08 PM   #1
Location: California

Join Date: Jan 2010
Posts: 12
Default htseq-count gets more reads?


My sam file had been sorted by read names using 'samtools sort -n'. Then I used htseq-count to count reads in this sam file; I also double checked by counting unique read names using awk/uniq/wc. The sum of htseq-count output is larger than the number of unique read names. How can this happen? htseq-count count some reads multiple times? Thanks for any hints!
deepsea is offline   Reply With Quote
Old 03-29-2012, 05:57 AM   #2
Senior Member
Location: Marburg, Germany

Join Date: Oct 2009
Posts: 110

My guess: you have reads that were mapped to multiple locations, and htseq doesn't remove duplicates.

You should get the same count if you drop the uniq form your command line (which amounts just to samtools view | wc -l, I guess)
ffinkernagel is offline   Reply With Quote
Old 03-29-2012, 06:50 AM   #3
Senior Member
Location: US

Join Date: Jan 2009
Posts: 392

How did you sum up the htseq-count output? Because at the end of the htseq-count output you get these categories:


The last category is the number of reads with multiple hits, if you simply summed up all your output, without disregarding this last category then its going to be larger.
chadn737 is offline   Reply With Quote
Old 03-29-2012, 11:27 AM   #4
Location: California

Join Date: Jan 2010
Posts: 12

Thanks for the responses, guys!

I think I find the reason. The number of 'alignment_not_unique' is not what I expected.

First of all, my understanding is the reads with multiple hits are not counted in any genes.

So if it is true, and if 'alignment_not_unique' is the number of reads with multiple hits, the sum of reads mapped in genes (uniquely), reads not in genes (uniquely), ambiguous, multiple hits will be exact the total number in the SAM file.

What I did is: using 'NH:i' tag, I separated alignments into two sam files, one is unique mappings (NH:i:1), the other is multiple hits (NH:i:n, n>1). Then ran htseq-count on both, and counted the unique read IDs in both sam files.

In total, I have 12.8M reads, 11.9M in the unique mapping sam, 0.9M in the other sam.

In the htseq-count of the unique sam, the 'alignment_not_unique' category is 0, and the sum of all genes and other categories is 11.8M (very close to 11.9, I am satisfied with it).

In the htseq-cout of the other sam, all other genes and categories are 0, and the 'alignment_not_unique' category is 2.8M. Remember that sam has 0.9M unique read IDs, and 4.1M lines.

So my conclusions:

1. reads won't be count multiple times; if they cannot be uniquely mapped, they are counted in some categories.

2. the numbers of genes and other categories are the number of reads; but the number of 'alignment_not_unique' is neither the number of reads nor the number of alignments. (my data is pair-end sequencing.)

The reason I need these numbers is that I want to understand the RNA compositions (proportions of different regions), which is useful when comparing different biological samples.

Right now, I take all numbers in htseq-count output except 'alignment_not_unique', then add the number of unique IDs with NH:i:n tag, the sum is (almost) what I expected.

Simon, if you read this thread, do you think it is good to make this number is also a number of reads, so the whole output is consistent?
deepsea is offline   Reply With Quote

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

All times are GMT -8. The time now is 10:00 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO