SEQanswers

Go Back   SEQanswers > General



Similar Threads
Thread Thread Starter Forum Replies Last Post
htseq-count for sam and gff3 sofia17 RNA Sequencing 45 11-04-2016 03:32 PM
Strange error when using htseq-count shhuang Bioinformatics 13 11-19-2012 12:40 AM
htseq count missing last mapped fragment mapardo RNA Sequencing 2 11-08-2011 05:27 AM
Count difference by htseq-cout and samtools DZhang Bioinformatics 2 07-03-2011 11:05 AM
Raw readcounts for RNAseq data using CountOverlaps function in IRanges biofreak General 1 06-28-2011 01:32 PM

Reply
 
Thread Tools
Old 06-29-2011, 07:17 AM   #1
biofreak
Member
 
Location: USA

Join Date: Jun 2011
Posts: 44
Default Htseq-count Vs CountOverlap function in IRanges

Hi All,
I computed the raw read counts for one of my SAM files using htdeq-count program as well as using 'countOverlaps' function in IRanges package in R.

CountOverlaps function gives me the read counts for 23123 gene ids.
Output of htseq gives read counts for 35886 refseq ids. (NM as well as NR ids).
After matching the common genes (from CountOverlaps output), I got 32222 genes total from htseq-counts function. I observed that the gene ids were not unique. (21423 unique gene ids.)

The computations were comparable if not same for some of the genes. However, for significant portion of the genes, the raw read counts did not match between the two outputs.
What result is more accurate? Should I sum up the count for same gene id (for htseq-count)?
Can someone please help? I have been stuck with this issue for days now...
thanks a lot.
biofreak is offline   Reply With Quote
Old 06-29-2011, 08:17 AM   #2
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

htseq-count gives you one count for each gene ID. I cannot imagine how you could have managed to get more than one count value per ID. (I can, however, see, how one could use countOverlaps in a way that gives multiple count values per ID.) Please give more details on what you did.

Also, both methods should give you a perfectly accurate result; the difference is because they use different rules for special cases. You will have to read the rules and figure out what is more appropriate for your use case.
Simon Anders is offline   Reply With Quote
Old 06-29-2011, 08:42 AM   #3
biofreak
Member
 
Location: USA

Join Date: Jun 2011
Posts: 44
Default

thanks for replying.
I gave the following command.
python -m HTSeq.scripts.count ./tophat_out_SRR037447/accepted_hits.sam ./genomes/hg19RefGene.gtf -a 1 -i gene_id -o seqres37447filter > seq37447filter

Oh BTW --minaqual option is not making any difference in the results. Maybe I am specifying it wrong?

this was the outout:
NM_000014 2234
NM_000015 0
NM_000016 0
NM_000017 125
NM_000018 15
NM_000019 246
NM_000020 0
NM_000021 0
NM_000022 489
NM_000023 2

I then mapped the NM numbers to gene IDs and observed multiple NM numbers for the same gene id. e.g.
NM_005465 34
NM_181690 2

I do understand that it is normal. My question is should I add up the counts for that gene ID?
Could you please help?

Last edited by biofreak; 06-29-2011 at 09:05 AM.
biofreak is offline   Reply With Quote
Old 06-29-2011, 09:46 AM   #4
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

You did not use a proper GTF file in Ensembl GTF format. In a proper GTF file, each line describing an exon has an attribute called 'gene_id', which gives the gene ID. All exons from the same gene (no matter which transcripts) must have the same gene ID.

The idea of GTF files is that the information is on three levels. A gene (given by its gene ID) has several transcripts (with the same gene ID but different transcript IDs) , each of which has several exon lines. The UCSC table browser, for example, produces GTF file in which the gene ID is always the same as the transcript ID, i.e., it does not show which transcripts belong to the same gene. Obviously, this is not useful, and hence, htseq-count won't work with this.

What you need to do is to replace, in your GTF file, the NM_number after 'gene ID' by the correct gene ID (which should have been there from the beginning, of course). Or you use a GTF file from Ensembl, which has proper gene IDs.

You can use HTSeq to do this (if you know some Python).
Simon Anders is offline   Reply With Quote
Old 06-29-2011, 10:26 AM   #5
biofreak
Member
 
Location: USA

Join Date: Jun 2011
Posts: 44
Default

Thanks a lot Simon.
I downloaded the gtf file from ensemble and ran the program again. It however gives me the following error for all the reads.
Skipping read 'SRR037447.3320388', because chromosome 'chr20', to which it has been aligned, did not appear in the GFF file.

Do I need to make any changes to the SAM file?

I am trying to replcace NM numbers with gene ids in my previous gtf file.
thanks a lot.
biofreak is offline   Reply With Quote
Old 06-29-2011, 10:36 AM   #6
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

could be that the chromsome is called 'chr20' in your sam file and just '20' in the GTF file.
Simon Anders is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:26 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO