Unconfigured Ad

**chadn737** · 05-05-2012, 05:59 AM

This category here:

alignment_not_unique 9401567

You have over 9 million reads without unique alignments. These reads map at a minimum of 2 places, maybe more. So at the minimum, the category would add twice as many alignments, if not more.

What you got to remember is that your bam/sam file contains alignments, not reads. You can have multiple alignments per read and these will each be a different line. Htseq-count counts reads, not alignments. So the total will always be smaller than the number of alignments in your bam/sam.

**oliviera** · 05-05-2012, 07:03 AM

It make a lot f sense, indeed. Thanx for you fast reply

Olivier

**fpenagarican** · 03-22-2013, 10:37 AM

I completely agree with your answer. Now, I have a closely related question. I will start with a description of my problem in order to state precisely my question:

Firstly, I count the total number of alignments in my file.sam

total.alignments = wc -l file.sam

Secondly, I count the number of uniquely mapped reads in my file.sam

uniquely.mapped.reads = cut -f 1 file.sam | sort | uniq -u | wc -l

Interestingly, alignment_not_unique (in the output of htseq-count) is the difference of these two numbers,

alignment_not_unique = total.alignments - uniquely.mapped.reads

Thirdly, I count the total number of reads reported by htseq-count excluding alignment_not_unique

count.htseq = inside genes (sum of counts) + no_feature + ambiguous + too_low_aQual + not_aligned

Now, I thought that this number should be equal to the number of uniquely mapped reads BUT it is not - actually count.htseq is smaller than uniquely.mapped.genes.

count.htseq < uniquely.mapped.reads

Why is this? Could you explain me?

Thanks in advance.

**Simon Anders** · 03-22-2013, 10:53 AM

Three possibilities

- You have paired-end data, and some reads are missing their mates. (Shouldn't happen but does with some aligners.)

- You have a chromosome with no genes on it (often the mitochondrion, which tends to be missing in GTF files). The reads from this chromosome should be added to no_feature but weren't until a week ago, when I finally fixed this bug.

- Another bug that I haven't figured out yet.

**chadn737** · 03-22-2013, 11:03 AM

Smaller by how much? A lot or only a few lines?

**fpenagarican** · 03-22-2013, 11:29 AM

Thanks for the answers. (1) I am working with single-end reads and (2) I do not have chrM in my GTF file. Anyway, it seems that the difference is important - more than 3M reads.

These are the numbers:

total alignments in file.sam = 19,609,136

uniquely mapped reads = 13,804,648

Here is the output of htseq-count:

inside genes (sum of counts) = 7,579,921
no_feature = 3,037,647
ambiguous = 113513
too_low_aQual = 0
not_aligned = 0
alignment_not_unique = 5,804,488

As I said before, alignment_not_unique = total alignments - uniquely mapped reads = 19,609,136 - 13,804,648 = 5,804,488

Now, inside genes + no_feature + ambiguos = 10,731,081 and uniquely mapped reads are 13,804,648

By the way, I am using htseq-count version 0.5.3p9.

What do you think? Thanks again.

**Simon Anders** · 03-22-2013, 11:34 AM

Strange. If you want to get to the bottom of this, use the '--samout' option. This writes out a SAM file which should contain each line fo the initial SAM file, with an extra fielded added indicating how the line was counted. And then you could compare the original and the new SAM file. Same number of lines? Does the number of lines containing the word "no_feature" fit? Etc. Thanks in advance if you have time to check this.

**fpenagarican** · 03-22-2013, 11:59 AM

Thanks Simon. I will check this issue. By the way, you told me that you had fixed the bug related to chromosomes with no genes. When are you planing to release the new version? I would like to use it and see what happens. Thanks again.

**Simon Anders** · 03-22-2013, 12:07 PM

The fix is in v0.5.4p1, which I uploaded to PyPI two weeks ago.

**kmcarr** · 03-22-2013, 01:15 PM

Originally posted by Simon Anders View Post

The fix is in v0.5.4p1, which I uploaded to PyPI two weeks ago.

I'm sorry to thread-jack this but I found that v0.5.4.1p1 does not work with python 2.5 (tested version is 2.5.2). Trying to run htseq-count produced the following error:

Code:

htseq-count -h
Traceback (most recent call last):
  File "/usr/local/bin/htseq-count", line 5, in <module>
    pkg_resources.run_script('HTSeq==0.5.4p1', 'htseq-count')
  File "build/bdist.linux-i686/egg/pkg_resources.py", line 489, in run_script
  File "build/bdist.linux-i686/egg/pkg_resources.py", line 1207, in run_script
  File "/usr/local/lib/python2.5/site-packages/HTSeq-0.5.4p1-py2.5-linux-x86_64.egg/EGG-INFO/scripts/htseq-count", line 3, in <module>
    import HTSeq.scripts.count
  File "/usr/local/lib/python2.5/site-packages/HTSeq-0.5.4p1-py2.5-linux-x86_64.egg/HTSeq/__init__.py", line 836
    except ValueError as e:
                       ^
SyntaxError: invalid syntax

Google taught me that python 2.5 doesn't have the 'as' keyword. Changing line 836 to:

Code:

        except ValueError, e:

appears to have fixed the problem but I have not fully tested it yet.

**fpenagarican** · 03-24-2013, 06:57 PM

Simon:

Just to let you know that I have tried with the new version of htseq-count (v0.5.4.1p1) and now everything works great, i.e.

inside genes (sum of counts) + no_feature + ambiguos + too_low_aQual + not_aligned = uniquely mapped reads.

Thanks again.

**Simon Anders** · 03-24-2013, 11:10 PM

Thanks, that's reassuring.

**abebe** · 07-25-2013, 06:50 AM

I am interested to know the identity of reads that map/do not map in a sam file. Say I got the following numbers when I run HTSeq:

total number of reads = 3,771,311
no_feature = 123,330
ambiguous = 100,500
too_low_aQual = 0
not_aligned = 0
alignment_not_unique = 100,000

How do I extract IDs and the corresponding sequences (in fastq) for reads that map/do not map from the sam file generated by HTSeq?

As always your help is appreciated.

Thanks.

Tilahun

**Simon Anders** · 07-25-2013, 06:56 AM

use the "--samout" option

Topics	Statistics	Last Post
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 22 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 28 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM
A New Single-Cell Method Maps DNA-Protein Interactions by SEQadmin2 Started by SEQadmin2, 06-04-2026, 08:59 AM	0 responses 39 views 0 reactions	Last Post by SEQadmin2 06-04-2026, 08:59 AM
Long-Read RNA Sequencing Uncovers a Hidden Layer of Immune Cell Regulation by SEQadmin2 Started by SEQadmin2, 06-02-2026, 12:03 PM	0 responses 61 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 12:03 PM

Unconfigured Ad

total number of counts with HTseq

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News