SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
DEXSeq Using Counts File From htseq-count FuzzyCoder Bioinformatics 20 01-03-2016 11:18 PM
Compare RNA counts: HTSeq vs Partek schaffer RNA Sequencing 2 12-02-2011 09:01 AM
htseq-count with warning for every read to represent all of zero counts in output hibachings2013 RNA Sequencing 10 07-15-2011 10:19 AM
understanding HTSeq counts nimmi Bioinformatics 3 11-27-2010 07:24 PM
DESeq: Read counts vs. BP counts burkard Bioinformatics 0 08-05-2010 11:52 PM

Reply
 
Thread Tools
Old 05-05-2012, 03:19 AM   #1
oliviera
Member
 
Location: germany

Join Date: Apr 2010
Posts: 31
Default total number of counts with HTseq

Dear all,

I use HTseq with paired end data and would like to calculate the pourcentage of reads that do fall into an annotated genes. To do this I thought to add all count inside genes + no_feature + ambigous + too_low_qual + not aligned + not_unique. THe problem is that I thought this number should be equal to the number of reads inside my original BAM/SAM file, but this is not the case. Can someone explain me how to get these numbers?

Here are the details from HTseq:
no_feature 6084302
ambiguous 1921332
too_low_aQual 0
not_aligned 0
alignment_not_unique 9401567
inside genes (sum of counts): 74046502
Total= 91453703

Number of alignment in my SAM file (paired end)= 160754220
oliviera is offline   Reply With Quote
Old 05-05-2012, 05:59 AM   #2
chadn737
Senior Member
 
Location: US

Join Date: Jan 2009
Posts: 392
Default

This category here:

alignment_not_unique 9401567

You have over 9 million reads without unique alignments. These reads map at a minimum of 2 places, maybe more. So at the minimum, the category would add twice as many alignments, if not more.

What you got to remember is that your bam/sam file contains alignments, not reads. You can have multiple alignments per read and these will each be a different line. Htseq-count counts reads, not alignments. So the total will always be smaller than the number of alignments in your bam/sam.
chadn737 is offline   Reply With Quote
Old 05-05-2012, 07:03 AM   #3
oliviera
Member
 
Location: germany

Join Date: Apr 2010
Posts: 31
Default

It make a lot f sense, indeed. Thanx for you fast reply


Olivier
oliviera is offline   Reply With Quote
Old 03-22-2013, 10:37 AM   #4
fpenagarican
Junior Member
 
Location: Madison - Wisconsin

Join Date: Nov 2012
Posts: 5
Default

I completely agree with your answer. Now, I have a closely related question. I will start with a description of my problem in order to state precisely my question:

Firstly, I count the total number of alignments in my file.sam

total.alignments = wc -l file.sam

Secondly, I count the number of uniquely mapped reads in my file.sam

uniquely.mapped.reads = cut -f 1 file.sam | sort | uniq -u | wc -l

Interestingly, alignment_not_unique (in the output of htseq-count) is the difference of these two numbers,

alignment_not_unique = total.alignments - uniquely.mapped.reads

Thirdly, I count the total number of reads reported by htseq-count excluding alignment_not_unique

count.htseq = inside genes (sum of counts) + no_feature + ambiguous + too_low_aQual + not_aligned

Now, I thought that this number should be equal to the number of uniquely mapped reads BUT it is not - actually count.htseq is smaller than uniquely.mapped.genes.

count.htseq < uniquely.mapped.reads

Why is this? Could you explain me?

Thanks in advance.
fpenagarican is offline   Reply With Quote
Old 03-22-2013, 10:53 AM   #5
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Three possibilities

- You have paired-end data, and some reads are missing their mates. (Shouldn't happen but does with some aligners.)

- You have a chromosome with no genes on it (often the mitochondrion, which tends to be missing in GTF files). The reads from this chromosome should be added to no_feature but weren't until a week ago, when I finally fixed this bug.

- Another bug that I haven't figured out yet.
Simon Anders is offline   Reply With Quote
Old 03-22-2013, 11:03 AM   #6
chadn737
Senior Member
 
Location: US

Join Date: Jan 2009
Posts: 392
Default

Smaller by how much? A lot or only a few lines?
chadn737 is offline   Reply With Quote
Old 03-22-2013, 11:29 AM   #7
fpenagarican
Junior Member
 
Location: Madison - Wisconsin

Join Date: Nov 2012
Posts: 5
Default

Thanks for the answers. (1) I am working with single-end reads and (2) I do not have chrM in my GTF file. Anyway, it seems that the difference is important - more than 3M reads.

These are the numbers:

total alignments in file.sam = 19,609,136

uniquely mapped reads = 13,804,648

Here is the output of htseq-count:

inside genes (sum of counts) = 7,579,921
no_feature = 3,037,647
ambiguous = 113513
too_low_aQual = 0
not_aligned = 0
alignment_not_unique = 5,804,488

As I said before, alignment_not_unique = total alignments - uniquely mapped reads = 19,609,136 - 13,804,648 = 5,804,488

Now, inside genes + no_feature + ambiguos = 10,731,081 and uniquely mapped reads are 13,804,648

By the way, I am using htseq-count version 0.5.3p9.

What do you think? Thanks again.
fpenagarican is offline   Reply With Quote
Old 03-22-2013, 11:34 AM   #8
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Strange. If you want to get to the bottom of this, use the '--samout' option. This writes out a SAM file which should contain each line fo the initial SAM file, with an extra fielded added indicating how the line was counted. And then you could compare the original and the new SAM file. Same number of lines? Does the number of lines containing the word "no_feature" fit? Etc. Thanks in advance if you have time to check this.
Simon Anders is offline   Reply With Quote
Old 03-22-2013, 11:59 AM   #9
fpenagarican
Junior Member
 
Location: Madison - Wisconsin

Join Date: Nov 2012
Posts: 5
Default

Thanks Simon. I will check this issue. By the way, you told me that you had fixed the bug related to chromosomes with no genes. When are you planing to release the new version? I would like to use it and see what happens. Thanks again.
fpenagarican is offline   Reply With Quote
Old 03-22-2013, 12:07 PM   #10
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

The fix is in v0.5.4p1, which I uploaded to PyPI two weeks ago.
Simon Anders is offline   Reply With Quote
Old 03-22-2013, 01:15 PM   #11
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,178
Default

Quote:
Originally Posted by Simon Anders View Post
The fix is in v0.5.4p1, which I uploaded to PyPI two weeks ago.
I'm sorry to thread-jack this but I found that v0.5.4.1p1 does not work with python 2.5 (tested version is 2.5.2). Trying to run htseq-count produced the following error:

Code:
htseq-count -h
Traceback (most recent call last):
  File "/usr/local/bin/htseq-count", line 5, in <module>
    pkg_resources.run_script('HTSeq==0.5.4p1', 'htseq-count')
  File "build/bdist.linux-i686/egg/pkg_resources.py", line 489, in run_script
  File "build/bdist.linux-i686/egg/pkg_resources.py", line 1207, in run_script
  File "/usr/local/lib/python2.5/site-packages/HTSeq-0.5.4p1-py2.5-linux-x86_64.egg/EGG-INFO/scripts/htseq-count", line 3, in <module>
    import HTSeq.scripts.count
  File "/usr/local/lib/python2.5/site-packages/HTSeq-0.5.4p1-py2.5-linux-x86_64.egg/HTSeq/__init__.py", line 836
    except ValueError as e:
                       ^
SyntaxError: invalid syntax
Google taught me that python 2.5 doesn't have the 'as' keyword. Changing line 836 to:

Code:
        except ValueError, e:
appears to have fixed the problem but I have not fully tested it yet.
kmcarr is offline   Reply With Quote
Old 03-24-2013, 06:57 PM   #12
fpenagarican
Junior Member
 
Location: Madison - Wisconsin

Join Date: Nov 2012
Posts: 5
Default

Simon:

Just to let you know that I have tried with the new version of htseq-count (v0.5.4.1p1) and now everything works great, i.e.

inside genes (sum of counts) + no_feature + ambiguos + too_low_aQual + not_aligned = uniquely mapped reads.

Thanks again.
fpenagarican is offline   Reply With Quote
Old 03-24-2013, 11:10 PM   #13
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Thanks, that's reassuring.
Simon Anders is offline   Reply With Quote
Old 07-25-2013, 06:50 AM   #14
abebe
Junior Member
 
Location: Cedar Falls, IA

Join Date: Dec 2011
Posts: 5
Default

I am interested to know the identity of reads that map/do not map in a sam file. Say I got the following numbers when I run HTSeq:

total number of reads = 3,771,311
no_feature = 123,330
ambiguous = 100,500
too_low_aQual = 0
not_aligned = 0
alignment_not_unique = 100,000

How do I extract IDs and the corresponding sequences (in fastq) for reads that map/do not map from the sam file generated by HTSeq?

As always your help is appreciated.

Thanks.

Tilahun
abebe is offline   Reply With Quote
Old 07-25-2013, 06:56 AM   #15
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

use the "--samout" option
Simon Anders is offline   Reply With Quote
Old 07-26-2013, 04:32 AM   #16
abebe
Junior Member
 
Location: Cedar Falls, IA

Join Date: Dec 2011
Posts: 5
Default

Simon,

Thanks. The "--samout" option works great. I can see each read with its assignment to a feature with the tag ‘XF’. My next question is how do I extract reads with no_feature, ambiguous, and alignment_not_unique from the rest of the mapped reads?

Thanks.

Tilahun
abebe is offline   Reply With Quote
Old 07-26-2013, 04:37 AM   #17
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Quote:
Originally Posted by abebe View Post
Simon,

Thanks. The "--samout" option works great. I can see each read with its assignment to a feature with the tag ‘XF’. My next question is how do I extract reads with no_feature, ambiguous, and alignment_not_unique from the rest of the mapped reads?

Thanks.

Tilahun
Just use grep.
dpryan is offline   Reply With Quote
Old 07-26-2013, 07:33 AM   #18
abebe
Junior Member
 
Location: Cedar Falls, IA

Join Date: Dec 2011
Posts: 5
Default

dpryan and Simon,

Thanks for your help. A combination of samout and grep solved my questions.

Keep up the good work.

Cheers.

Tilahun
abebe is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:57 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO