Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • oliviera
    Member
    • Apr 2010
    • 31

    total number of counts with HTseq

    Dear all,

    I use HTseq with paired end data and would like to calculate the pourcentage of reads that do fall into an annotated genes. To do this I thought to add all count inside genes + no_feature + ambigous + too_low_qual + not aligned + not_unique. THe problem is that I thought this number should be equal to the number of reads inside my original BAM/SAM file, but this is not the case. Can someone explain me how to get these numbers?

    Here are the details from HTseq:
    no_feature 6084302
    ambiguous 1921332
    too_low_aQual 0
    not_aligned 0
    alignment_not_unique 9401567
    inside genes (sum of counts): 74046502
    Total= 91453703

    Number of alignment in my SAM file (paired end)= 160754220
  • chadn737
    Senior Member
    • Jan 2009
    • 392

    #2
    This category here:

    alignment_not_unique 9401567

    You have over 9 million reads without unique alignments. These reads map at a minimum of 2 places, maybe more. So at the minimum, the category would add twice as many alignments, if not more.

    What you got to remember is that your bam/sam file contains alignments, not reads. You can have multiple alignments per read and these will each be a different line. Htseq-count counts reads, not alignments. So the total will always be smaller than the number of alignments in your bam/sam.

    Comment

    • oliviera
      Member
      • Apr 2010
      • 31

      #3
      It make a lot f sense, indeed. Thanx for you fast reply


      Olivier

      Comment

      • fpenagarican
        Junior Member
        • Nov 2012
        • 5

        #4
        I completely agree with your answer. Now, I have a closely related question. I will start with a description of my problem in order to state precisely my question:

        Firstly, I count the total number of alignments in my file.sam

        total.alignments = wc -l file.sam

        Secondly, I count the number of uniquely mapped reads in my file.sam

        uniquely.mapped.reads = cut -f 1 file.sam | sort | uniq -u | wc -l

        Interestingly, alignment_not_unique (in the output of htseq-count) is the difference of these two numbers,

        alignment_not_unique = total.alignments - uniquely.mapped.reads

        Thirdly, I count the total number of reads reported by htseq-count excluding alignment_not_unique

        count.htseq = inside genes (sum of counts) + no_feature + ambiguous + too_low_aQual + not_aligned

        Now, I thought that this number should be equal to the number of uniquely mapped reads BUT it is not - actually count.htseq is smaller than uniquely.mapped.genes.

        count.htseq < uniquely.mapped.reads

        Why is this? Could you explain me?

        Thanks in advance.

        Comment

        • Simon Anders
          Senior Member
          • Feb 2010
          • 995

          #5
          Three possibilities

          - You have paired-end data, and some reads are missing their mates. (Shouldn't happen but does with some aligners.)

          - You have a chromosome with no genes on it (often the mitochondrion, which tends to be missing in GTF files). The reads from this chromosome should be added to no_feature but weren't until a week ago, when I finally fixed this bug.

          - Another bug that I haven't figured out yet.

          Comment

          • chadn737
            Senior Member
            • Jan 2009
            • 392

            #6
            Smaller by how much? A lot or only a few lines?

            Comment

            • fpenagarican
              Junior Member
              • Nov 2012
              • 5

              #7
              Thanks for the answers. (1) I am working with single-end reads and (2) I do not have chrM in my GTF file. Anyway, it seems that the difference is important - more than 3M reads.

              These are the numbers:

              total alignments in file.sam = 19,609,136

              uniquely mapped reads = 13,804,648

              Here is the output of htseq-count:

              inside genes (sum of counts) = 7,579,921
              no_feature = 3,037,647
              ambiguous = 113513
              too_low_aQual = 0
              not_aligned = 0
              alignment_not_unique = 5,804,488

              As I said before, alignment_not_unique = total alignments - uniquely mapped reads = 19,609,136 - 13,804,648 = 5,804,488

              Now, inside genes + no_feature + ambiguos = 10,731,081 and uniquely mapped reads are 13,804,648

              By the way, I am using htseq-count version 0.5.3p9.

              What do you think? Thanks again.

              Comment

              • Simon Anders
                Senior Member
                • Feb 2010
                • 995

                #8
                Strange. If you want to get to the bottom of this, use the '--samout' option. This writes out a SAM file which should contain each line fo the initial SAM file, with an extra fielded added indicating how the line was counted. And then you could compare the original and the new SAM file. Same number of lines? Does the number of lines containing the word "no_feature" fit? Etc. Thanks in advance if you have time to check this.

                Comment

                • fpenagarican
                  Junior Member
                  • Nov 2012
                  • 5

                  #9
                  Thanks Simon. I will check this issue. By the way, you told me that you had fixed the bug related to chromosomes with no genes. When are you planing to release the new version? I would like to use it and see what happens. Thanks again.

                  Comment

                  • Simon Anders
                    Senior Member
                    • Feb 2010
                    • 995

                    #10
                    The fix is in v0.5.4p1, which I uploaded to PyPI two weeks ago.

                    Comment

                    • kmcarr
                      Senior Member
                      • May 2008
                      • 1181

                      #11
                      Originally posted by Simon Anders View Post
                      The fix is in v0.5.4p1, which I uploaded to PyPI two weeks ago.
                      I'm sorry to thread-jack this but I found that v0.5.4.1p1 does not work with python 2.5 (tested version is 2.5.2). Trying to run htseq-count produced the following error:

                      Code:
                      htseq-count -h
                      Traceback (most recent call last):
                        File "/usr/local/bin/htseq-count", line 5, in <module>
                          pkg_resources.run_script('HTSeq==0.5.4p1', 'htseq-count')
                        File "build/bdist.linux-i686/egg/pkg_resources.py", line 489, in run_script
                        File "build/bdist.linux-i686/egg/pkg_resources.py", line 1207, in run_script
                        File "/usr/local/lib/python2.5/site-packages/HTSeq-0.5.4p1-py2.5-linux-x86_64.egg/EGG-INFO/scripts/htseq-count", line 3, in <module>
                          import HTSeq.scripts.count
                        File "/usr/local/lib/python2.5/site-packages/HTSeq-0.5.4p1-py2.5-linux-x86_64.egg/HTSeq/__init__.py", line 836
                          except ValueError as e:
                                             ^
                      SyntaxError: invalid syntax
                      Google taught me that python 2.5 doesn't have the 'as' keyword. Changing line 836 to:

                      Code:
                              except ValueError, e:
                      appears to have fixed the problem but I have not fully tested it yet.

                      Comment

                      • fpenagarican
                        Junior Member
                        • Nov 2012
                        • 5

                        #12
                        Simon:

                        Just to let you know that I have tried with the new version of htseq-count (v0.5.4.1p1) and now everything works great, i.e.

                        inside genes (sum of counts) + no_feature + ambiguos + too_low_aQual + not_aligned = uniquely mapped reads.

                        Thanks again.

                        Comment

                        • Simon Anders
                          Senior Member
                          • Feb 2010
                          • 995

                          #13
                          Thanks, that's reassuring.

                          Comment

                          • abebe
                            Junior Member
                            • Dec 2011
                            • 5

                            #14
                            I am interested to know the identity of reads that map/do not map in a sam file. Say I got the following numbers when I run HTSeq:

                            total number of reads = 3,771,311
                            no_feature = 123,330
                            ambiguous = 100,500
                            too_low_aQual = 0
                            not_aligned = 0
                            alignment_not_unique = 100,000

                            How do I extract IDs and the corresponding sequences (in fastq) for reads that map/do not map from the sam file generated by HTSeq?

                            As always your help is appreciated.

                            Thanks.

                            Tilahun

                            Comment

                            • Simon Anders
                              Senior Member
                              • Feb 2010
                              • 995

                              #15
                              use the "--samout" option

                              Comment

                              Latest Articles

                              Collapse

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, 06-09-2026, 11:58 AM
                              0 responses
                              22 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-05-2026, 10:09 AM
                              0 responses
                              28 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-04-2026, 08:59 AM
                              0 responses
                              39 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 12:03 PM
                              0 responses
                              61 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...