Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • biofreak
    Member
    • Jun 2011
    • 44

    Htseq-count Vs CountOverlap function in IRanges

    Hi All,
    I computed the raw read counts for one of my SAM files using htdeq-count program as well as using 'countOverlaps' function in IRanges package in R.

    CountOverlaps function gives me the read counts for 23123 gene ids.
    Output of htseq gives read counts for 35886 refseq ids. (NM as well as NR ids).
    After matching the common genes (from CountOverlaps output), I got 32222 genes total from htseq-counts function. I observed that the gene ids were not unique. (21423 unique gene ids.)

    The computations were comparable if not same for some of the genes. However, for significant portion of the genes, the raw read counts did not match between the two outputs.
    What result is more accurate? Should I sum up the count for same gene id (for htseq-count)?
    Can someone please help? I have been stuck with this issue for days now...
    thanks a lot.
  • Simon Anders
    Senior Member
    • Feb 2010
    • 995

    #2
    htseq-count gives you one count for each gene ID. I cannot imagine how you could have managed to get more than one count value per ID. (I can, however, see, how one could use countOverlaps in a way that gives multiple count values per ID.) Please give more details on what you did.

    Also, both methods should give you a perfectly accurate result; the difference is because they use different rules for special cases. You will have to read the rules and figure out what is more appropriate for your use case.

    Comment

    • biofreak
      Member
      • Jun 2011
      • 44

      #3
      thanks for replying.
      I gave the following command.
      python -m HTSeq.scripts.count ./tophat_out_SRR037447/accepted_hits.sam ./genomes/hg19RefGene.gtf -a 1 -i gene_id -o seqres37447filter > seq37447filter

      Oh BTW --minaqual option is not making any difference in the results. Maybe I am specifying it wrong?

      this was the outout:
      NM_000014 2234
      NM_000015 0
      NM_000016 0
      NM_000017 125
      NM_000018 15
      NM_000019 246
      NM_000020 0
      NM_000021 0
      NM_000022 489
      NM_000023 2

      I then mapped the NM numbers to gene IDs and observed multiple NM numbers for the same gene id. e.g.
      NM_005465 34
      NM_181690 2

      I do understand that it is normal. My question is should I add up the counts for that gene ID?
      Could you please help?
      Last edited by biofreak; 06-29-2011, 09:05 AM.

      Comment

      • Simon Anders
        Senior Member
        • Feb 2010
        • 995

        #4
        You did not use a proper GTF file in Ensembl GTF format. In a proper GTF file, each line describing an exon has an attribute called 'gene_id', which gives the gene ID. All exons from the same gene (no matter which transcripts) must have the same gene ID.

        The idea of GTF files is that the information is on three levels. A gene (given by its gene ID) has several transcripts (with the same gene ID but different transcript IDs) , each of which has several exon lines. The UCSC table browser, for example, produces GTF file in which the gene ID is always the same as the transcript ID, i.e., it does not show which transcripts belong to the same gene. Obviously, this is not useful, and hence, htseq-count won't work with this.

        What you need to do is to replace, in your GTF file, the NM_number after 'gene ID' by the correct gene ID (which should have been there from the beginning, of course). Or you use a GTF file from Ensembl, which has proper gene IDs.

        You can use HTSeq to do this (if you know some Python).

        Comment

        • biofreak
          Member
          • Jun 2011
          • 44

          #5
          Thanks a lot Simon.
          I downloaded the gtf file from ensemble and ran the program again. It however gives me the following error for all the reads.
          Skipping read 'SRR037447.3320388', because chromosome 'chr20', to which it has been aligned, did not appear in the GFF file.

          Do I need to make any changes to the SAM file?

          I am trying to replcace NM numbers with gene ids in my previous gtf file.
          thanks a lot.

          Comment

          • Simon Anders
            Senior Member
            • Feb 2010
            • 995

            #6
            could be that the chromsome is called 'chr20' in your sam file and just '20' in the GTF file.

            Comment

            Latest Articles

            Collapse

            • SEQadmin2
              Nine Things a Sample Prep Scientist Thinks About Before Sequencing
              by SEQadmin2


              I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

              Here are nine questions we think about, in roughly the order they matter, before...
              06-18-2026, 07:11 AM
            • SEQadmin2
              From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
              by SEQadmin2


              Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


              The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
              ...
              06-02-2026, 10:05 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by SEQadmin2, Today, 11:10 AM
            0 responses
            6 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-17-2026, 06:09 AM
            0 responses
            41 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-09-2026, 11:58 AM
            0 responses
            102 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-05-2026, 10:09 AM
            0 responses
            123 views
            0 reactions
            Last Post SEQadmin2  
            Working...