Seqanswers Leaderboard Ad

**ffinkernagel** · 03-29-2012, 05:57 AM

My guess: you have reads that were mapped to multiple locations, and htseq doesn't remove duplicates.

You should get the same count if you drop the uniq form your command line (which amounts just to samtools view | wc -l, I guess)

**chadn737** · 03-29-2012, 06:50 AM

How did you sum up the htseq-count output? Because at the end of the htseq-count output you get these categories:

no_feature
ambiguous
too_low_aQual
not_aligned
alignment_not_unique

The last category is the number of reads with multiple hits, if you simply summed up all your output, without disregarding this last category then its going to be larger.

**deepsea** · 03-29-2012, 11:27 AM

Thanks for the responses, guys!

I think I find the reason. The number of 'alignment_not_unique' is not what I expected.

First of all, my understanding is the reads with multiple hits are not counted in any genes.

So if it is true, and if 'alignment_not_unique' is the number of reads with multiple hits, the sum of reads mapped in genes (uniquely), reads not in genes (uniquely), ambiguous, multiple hits will be exact the total number in the SAM file.

What I did is: using 'NH:i' tag, I separated alignments into two sam files, one is unique mappings (NH:i:1), the other is multiple hits (NH:i:n, n>1). Then ran htseq-count on both, and counted the unique read IDs in both sam files.

In total, I have 12.8M reads, 11.9M in the unique mapping sam, 0.9M in the other sam.

In the htseq-count of the unique sam, the 'alignment_not_unique' category is 0, and the sum of all genes and other categories is 11.8M (very close to 11.9, I am satisfied with it).

In the htseq-cout of the other sam, all other genes and categories are 0, and the 'alignment_not_unique' category is 2.8M. Remember that sam has 0.9M unique read IDs, and 4.1M lines.

So my conclusions:

1. reads won't be count multiple times; if they cannot be uniquely mapped, they are counted in some categories.

2. the numbers of genes and other categories are the number of reads; but the number of 'alignment_not_unique' is neither the number of reads nor the number of alignments. (my data is pair-end sequencing.)

The reason I need these numbers is that I want to understand the RNA compositions (proportions of different regions), which is useful when comparing different biological samples.

Right now, I take all numbers in htseq-count output except 'alignment_not_unique', then add the number of unique IDs with NH:i:n tag, the sum is (almost) what I expected.

Simon, if you read this thread, do you think it is good to make this number is also a number of reads, so the whole output is consistent?

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 25 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 24 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

htseq-count gets more reads?

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News