Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • total number of counts with HTseq

    Dear all,

    I use HTseq with paired end data and would like to calculate the pourcentage of reads that do fall into an annotated genes. To do this I thought to add all count inside genes + no_feature + ambigous + too_low_qual + not aligned + not_unique. THe problem is that I thought this number should be equal to the number of reads inside my original BAM/SAM file, but this is not the case. Can someone explain me how to get these numbers?

    Here are the details from HTseq:
    no_feature 6084302
    ambiguous 1921332
    too_low_aQual 0
    not_aligned 0
    alignment_not_unique 9401567
    inside genes (sum of counts): 74046502
    Total= 91453703

    Number of alignment in my SAM file (paired end)= 160754220

  • #2
    This category here:

    alignment_not_unique 9401567

    You have over 9 million reads without unique alignments. These reads map at a minimum of 2 places, maybe more. So at the minimum, the category would add twice as many alignments, if not more.

    What you got to remember is that your bam/sam file contains alignments, not reads. You can have multiple alignments per read and these will each be a different line. Htseq-count counts reads, not alignments. So the total will always be smaller than the number of alignments in your bam/sam.

    Comment


    • #3
      It make a lot f sense, indeed. Thanx for you fast reply


      Olivier

      Comment


      • #4
        I completely agree with your answer. Now, I have a closely related question. I will start with a description of my problem in order to state precisely my question:

        Firstly, I count the total number of alignments in my file.sam

        total.alignments = wc -l file.sam

        Secondly, I count the number of uniquely mapped reads in my file.sam

        uniquely.mapped.reads = cut -f 1 file.sam | sort | uniq -u | wc -l

        Interestingly, alignment_not_unique (in the output of htseq-count) is the difference of these two numbers,

        alignment_not_unique = total.alignments - uniquely.mapped.reads

        Thirdly, I count the total number of reads reported by htseq-count excluding alignment_not_unique

        count.htseq = inside genes (sum of counts) + no_feature + ambiguous + too_low_aQual + not_aligned

        Now, I thought that this number should be equal to the number of uniquely mapped reads BUT it is not - actually count.htseq is smaller than uniquely.mapped.genes.

        count.htseq < uniquely.mapped.reads

        Why is this? Could you explain me?

        Thanks in advance.

        Comment


        • #5
          Three possibilities

          - You have paired-end data, and some reads are missing their mates. (Shouldn't happen but does with some aligners.)

          - You have a chromosome with no genes on it (often the mitochondrion, which tends to be missing in GTF files). The reads from this chromosome should be added to no_feature but weren't until a week ago, when I finally fixed this bug.

          - Another bug that I haven't figured out yet.

          Comment


          • #6
            Smaller by how much? A lot or only a few lines?

            Comment


            • #7
              Thanks for the answers. (1) I am working with single-end reads and (2) I do not have chrM in my GTF file. Anyway, it seems that the difference is important - more than 3M reads.

              These are the numbers:

              total alignments in file.sam = 19,609,136

              uniquely mapped reads = 13,804,648

              Here is the output of htseq-count:

              inside genes (sum of counts) = 7,579,921
              no_feature = 3,037,647
              ambiguous = 113513
              too_low_aQual = 0
              not_aligned = 0
              alignment_not_unique = 5,804,488

              As I said before, alignment_not_unique = total alignments - uniquely mapped reads = 19,609,136 - 13,804,648 = 5,804,488

              Now, inside genes + no_feature + ambiguos = 10,731,081 and uniquely mapped reads are 13,804,648

              By the way, I am using htseq-count version 0.5.3p9.

              What do you think? Thanks again.

              Comment


              • #8
                Strange. If you want to get to the bottom of this, use the '--samout' option. This writes out a SAM file which should contain each line fo the initial SAM file, with an extra fielded added indicating how the line was counted. And then you could compare the original and the new SAM file. Same number of lines? Does the number of lines containing the word "no_feature" fit? Etc. Thanks in advance if you have time to check this.

                Comment


                • #9
                  Thanks Simon. I will check this issue. By the way, you told me that you had fixed the bug related to chromosomes with no genes. When are you planing to release the new version? I would like to use it and see what happens. Thanks again.

                  Comment


                  • #10
                    The fix is in v0.5.4p1, which I uploaded to PyPI two weeks ago.

                    Comment


                    • #11
                      Originally posted by Simon Anders View Post
                      The fix is in v0.5.4p1, which I uploaded to PyPI two weeks ago.
                      I'm sorry to thread-jack this but I found that v0.5.4.1p1 does not work with python 2.5 (tested version is 2.5.2). Trying to run htseq-count produced the following error:

                      Code:
                      htseq-count -h
                      Traceback (most recent call last):
                        File "/usr/local/bin/htseq-count", line 5, in <module>
                          pkg_resources.run_script('HTSeq==0.5.4p1', 'htseq-count')
                        File "build/bdist.linux-i686/egg/pkg_resources.py", line 489, in run_script
                        File "build/bdist.linux-i686/egg/pkg_resources.py", line 1207, in run_script
                        File "/usr/local/lib/python2.5/site-packages/HTSeq-0.5.4p1-py2.5-linux-x86_64.egg/EGG-INFO/scripts/htseq-count", line 3, in <module>
                          import HTSeq.scripts.count
                        File "/usr/local/lib/python2.5/site-packages/HTSeq-0.5.4p1-py2.5-linux-x86_64.egg/HTSeq/__init__.py", line 836
                          except ValueError as e:
                                             ^
                      SyntaxError: invalid syntax
                      Google taught me that python 2.5 doesn't have the 'as' keyword. Changing line 836 to:

                      Code:
                              except ValueError, e:
                      appears to have fixed the problem but I have not fully tested it yet.

                      Comment


                      • #12
                        Simon:

                        Just to let you know that I have tried with the new version of htseq-count (v0.5.4.1p1) and now everything works great, i.e.

                        inside genes (sum of counts) + no_feature + ambiguos + too_low_aQual + not_aligned = uniquely mapped reads.

                        Thanks again.

                        Comment


                        • #13
                          Thanks, that's reassuring.

                          Comment


                          • #14
                            I am interested to know the identity of reads that map/do not map in a sam file. Say I got the following numbers when I run HTSeq:

                            total number of reads = 3,771,311
                            no_feature = 123,330
                            ambiguous = 100,500
                            too_low_aQual = 0
                            not_aligned = 0
                            alignment_not_unique = 100,000

                            How do I extract IDs and the corresponding sequences (in fastq) for reads that map/do not map from the sam file generated by HTSeq?

                            As always your help is appreciated.

                            Thanks.

                            Tilahun

                            Comment


                            • #15
                              use the "--samout" option

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              24 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              25 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              21 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              52 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X