SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
htseq-count paolo.kunder Bioinformatics 10 10-22-2014 05:45 AM
Issue with htseq-count cpleis Bioinformatics 8 10-15-2012 10:31 AM
HTseq count to DESeq (?) ThePresident Bioinformatics 18 09-06-2012 09:12 AM
multiBamCov or htseq-count to count read per feature ? NicoBxl Bioinformatics 1 07-03-2012 03:05 AM
Count difference by htseq-cout and samtools DZhang Bioinformatics 2 07-03-2011 12:05 PM

Reply
 
Thread Tools
Old 11-11-2012, 11:17 AM   #1
camelbbs
Member
 
Location: United States

Join Date: Jun 2011
Posts: 49
Default difference between htseq-count and bedtools multicov

Hi,

Anyone have used htseq-count and bedtools multicov, I use both them to get the read counts for ref genes. But I got very different read counts between them. Which one is more common..

bedtools htseq-count
uc001aaa.3 8 0
uc001aac.4 2309 0
uc001aae.4 753 0
uc001aah.4 2309 0
uc001aai.1 586 0
uc001aak.3 0 0
uc001aal.1 0 0
uc001aam.4 1315 6
camelbbs is offline   Reply With Quote
Old 11-11-2012, 11:41 PM   #2
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 404
Default

i'd guess it's due to the fact that htseq-count only reports one hit per aligned read assuming that read is aligned uniquely and does not overlap multiple features listed in your GTF. if an aligned read hits more than one feature in your GTF then it doesn't report that hit. bedtools gives you raw hits which includes every 1 hit for every intersection of every alignment with any features in the GTF no matter how many times it aligned or how many features it hit. you might think, "wow, htseq-count is dropping a lot of information". yes, it is! i've moved to using other tools to count hits to genes (RSEM/eXpress) since they disambiguate those ambiguous alignments and as a result you get counts for all of your aligned reads. in a genome with alternative splicing you lose too much data using htseq-count, in my opinion.
sdriscoll is offline   Reply With Quote
Old 11-18-2012, 11:44 PM   #3
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 979
Default

sdriscoll, te source of your problems is simply that you use annotation files that are unsuitable for htseq-count.

If a read alignes two several transcripts of the same gene, it is counted for this gene; only if it overlaps with different genes, it is dropped.

To this end, your GTF file needs to indicate which transcripts belong to the same gene. The GTF standard specifies that exon lines from two transcripts of the same gene should have the same 'gene_id' but different 'transcript_id' values. Unfortunately, the Table Browser function of the UCSC Genome Browser returns faulty GTF files that just copy the transcript ID into the gene ID field, and with these, htseq-count does not work.

So, please don't use the UCSC Table Browser. (They explained on previous occasion that they are unable to fix this bug, which, in my opinion, is rather serious.)
Simon Anders is offline   Reply With Quote
Old 11-19-2012, 12:09 PM   #4
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 404
Default

Hi Simon - you're right about that. If you use the UCSC GTFs straight away then your program won't work well at all. I was being a little dramatic in my post however even if you have the gene names inserted in your GTF, which I always do, there's still a large number of multi-gene loci where information can be lost because of the unique gene_id requirement in htseq-count. With a quick loci clustering and a search for those with multiple gene names (using UCSC's knowngene annotation for mouse mm9) there's over 2200 loci with more than one gene_id which is close to 10% of the annotated loci in the mouse genome. I don't work with human much but I suspect there's more overlap there. Maybe not all of those multi-gene loci share all of their exons with each other but some of them do. For example there are loci like CC1/Rb1cc1 or Sgk3/Cisk. One would lose quite a few reads in those loci. So a bit more work would have to be done to one's annotation to ensure that they aren't losing too much information with htseq-count. For example htseq-count works beautifully if you collapse these loci into exon-unions. I've written my own read counters over the years and what seems to work best for me is to allow reads to be assigned to multiple features but to weight those reads by 1/(number of features). I could futher restrict that to only allowing those reads to be ones that are shared across features within the same locus but if I'm working with alignments that don't inlcude multi-aligned reads then that doesn't matter. This way I get as many reads assigned to features as possible and when I collapse the loci down into "genes" I can just add the features' counts. In the end I get the same counts as I would have by using a modifed GTF with overlapped features collapsed into unions without having to generate that extra GTF file.
sdriscoll is offline   Reply With Quote
Old 11-20-2012, 01:15 AM   #5
syfo
Just a member
 
Location: Southern EU

Join Date: Nov 2012
Posts: 103
Default

Sdriscoll, I think that the main issue here is not due to "genuinely" overlapping genes but to the bug Simon mentioned: in a raw GTF from the UCSC table browser all the transcripts have a different "gene_id" whether they come from the same locus or not.
I count 55,419 different identifiers in the "gene_id" field of the UCSC knownGene GTF file for mm9, not 2,200.
syfo is offline   Reply With Quote
Old 11-20-2012, 02:01 AM   #6
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 404
Default

That's correct. I however am not working with the raw Gtf to get that number. I swap in the real gene symbols into the gene_id field. I also ran an analysis of the entire gtf collapsing different transcript ids down into loci where different transcripts share some sequence between them. I actually just use gffread (from Cufflinks) for that. Then parsing that result I find >2200 loci with multiple gene names, and obviously multiple transcript ids. If you look at the gene loci I mentioned in the UCSC genome browser for mm9 you can see what i mean.
sdriscoll is offline   Reply With Quote
Old 11-20-2012, 02:19 AM   #7
syfo
Just a member
 
Location: Southern EU

Join Date: Nov 2012
Posts: 103
Default

Yes, replacing UCSC gene_id by NCBI gene symbols is the good move. Interesting to see that still more than 10% of the historically annotated loci overlap with at least another one in the mouse genome, even when considering "real gene symbols".
syfo is offline   Reply With Quote
Old 11-20-2012, 05:07 AM   #8
pbluescript
Senior Member
 
Location: Boston

Join Date: Nov 2009
Posts: 224
Default

The UCSC gene name substitution is a problem I've run into before too.
Here's a blog post I found that offers a solution for anyone who's interested:
http://compbionovice.blogspot.com/20...ranscript.html
pbluescript is offline   Reply With Quote
Old 11-20-2012, 10:50 AM   #9
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 404
Default

maybe it's just me but i've always been annoyed with official gene symbols. from the start i've never thought they were reliable to base any part of my codes around so when i've written analysis tools for my lab i use more checks to make sure my code knows what it's looking at and so i don't confuse anybody. from a programmer's point of view it's just a design constraint and one less assumption that can be made along the way. i end up dealing with a lot of different annotations, sometimes custom made, and different genomes and it's easier for me if my tools don't put too much requirement on the naming conventions of the annotation.

i'm only pointing out the fact that even if you're working with official gene symbols as the 'gene_id' field then a counter like htseq-count will end up throwing out some reads from many loci where all that's actually happening is some illogical historical naming issues. that's not the researcher's fault. some of those locations in the genome are overlap of adjacent genes but sometimes it's just what looks like a typical alternatively spliced gene locus but for some reason not all of the isoforms have the same gene name. those are the ones that are the issue. it's a minor point - yes if you have the right annotation it's irrelevant.

the human genome is even more complex. for example this SMN2 locus shows what I'm talking about.

http://genome.ucsc.edu/cgi-bin/hgTra...knownGene=pack

looks like your typical alternatively spliced locus but it houses SMN1 and SMN2. UCSC's track and the Refseq track both show this naming issue.
sdriscoll is offline   Reply With Quote
Old 11-21-2012, 01:13 AM   #10
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 979
Default

I never really ran into these problems myself and I wonder if this might be because I only use Ensembl annotation. Ensembl changes their gene IDs quite often so that you have to remap these ENBG000... IDs to HGNC symbols anew for each Ensembl release but the advantage is that this gives the Ensembl people the flexibility to change easily what they consider one gene. This might explain why I feel that the described issue with overlapping genes is somehow avoided in Ensembl annotation
Simon Anders is offline   Reply With Quote
Old 11-21-2012, 09:13 AM   #11
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 404
Default

That explains it! Thanks for pointing that out. It's unfortunate that the annotations are so lame. Cufflinks can actually help with this but people still will need to parse the gtf to build a translation table. You can use cuffcompare to compare the gtf to itself and that produces cuffcmp.combined.gtf. In that file gene ids are replaced with XLOC ids which are unique to loci. The gene names are set in the gene_name field, if they were in your gtf to start. You can also use the gffread program that comes packaged with Cufflinks. It has a --cluster-only option that adds a locus attribute to each gtf row grouping the features into unique loci. One could specify that attribute when running htseq-count and then translate the locus ids bak to gene names. I think ideally the counting script should do the clustering or at least provide the option. Then the annotation doesn't have to be perfect.
sdriscoll is offline   Reply With Quote
Old 01-14-2013, 07:17 AM   #12
wanfahmi
Member
 
Location: North Sea

Join Date: Apr 2008
Posts: 26
Default

How can I convert the ENSG00000261838 to real name of genes? Or is there any parameter do I need to use prior using htseq-count? Thank you

Quote:
ENSG00000261838 0
ENSG00000261839 2
ENSG00000261840 3
ENSG00000261841 0
ENSG00000261842 0
no_feature 985727
ambiguous 435068
too_low_aQual 0
not_aligned 0
alignment_not_unique 778451
wanfahmi is offline   Reply With Quote
Old 01-30-2013, 10:50 PM   #13
jingerlu
Junior Member
 
Location: los angeles

Join Date: Mar 2011
Posts: 4
Default

usually , I go to ensemble(www.ensembl.org) => bioMart . CHOOSE DATABSE ( which is Ensemble gene 70 currently ) , CHOOSE DATASET (which is Homo sapiens for Human), once these two steps done , the left side of the webpage
will show Dataset / filter / Attributes , click Attributes (check Features) , expand "gene" , check "Ensembl gene id" "Associated gene name" "description" or other features you want . Click "Result " which is at the top left side of the webpage. you can "Export all result to " there .

Quote:
Originally Posted by wanfahmi View Post
How can I convert the ENSG00000261838 to real name of genes? Or is there any parameter do I need to use prior using htseq-count? Thank you
jingerlu is offline   Reply With Quote
Old 03-06-2013, 07:16 PM   #14
zaki
Member
 
Location: Malaysia

Join Date: Dec 2012
Posts: 15
Default

Hi members,

Reviving an old topic... I am bewildered about htseq-count output. The endpoint is I will be using DEseq for differential expression, but I am confused with the htseq-count results.

I compared the htseq-count results with what I view in IGV... and I see very inconsistent things which I can not explain.. I was hoping if anyone can give an insight into this matter..

Quick overview on what I did
Mapped the paired end .fastq file to a reference transcription using tophat2
[CODE]"tophat2 -G <transcript.gtf> -o <outputdir> --no-novel-juncs <bowtie2 indexed genome> <.fastq1> <.fastq2> [CODE]

Following this I converted the resulting .bam file to .sam in order to proceed with read counting using htseq-count
#Converting .bam to .sam
Code:
samtools view -h <bam> > <sam>
#Sorting the .sam (Would be great if anyone can tell me if this is the correct method or not, as sometimes it returns some error - I also asked here http://seqanswers.com/forums/showthr...t=12376&page=2)
Code:
sort -k1 <sam> > <sorted.sam>
#Running htseq-count
Code:
htseq-count -m intersection-strict -s no <sorted.sam> <transripts.gtf> > <result_count>
Above is the general code I used, I also tried running htseq-count with -m intersection-nonempty.

The problem I have now is when I look at the gene count and IGV it gives confusing statement

Location 1
Result from htseq-count gives - 14 (regardless which htseq-count mode)
Image from IGV


Location 2
Result from htseq-count gives - 7 (regardless which htseq-count mode)
Image from IGV


As you can see, at Location 2 has more reads, but how come the counts number only is 7??

Could it be to do with the .gtf file I am using as mentioned by
Quote:
Originally Posted by sdriscoll View Post
i'd guess it's due to the fact that htseq-count only reports one hit per aligned read assuming that read is aligned uniquely and does not overlap multiple features listed in your GTF. if an aligned read hits more than one feature in your GTF then it doesn't report that hit. bedtools gives you raw hits which includes every 1 hit for every intersection of every alignment with any features in the GTF no matter how many times it aligned or how many features it hit. you might think, "wow, htseq-count is dropping a lot of information". yes, it is! i've moved to using other tools to count hits to genes (RSEM/eXpress) since they disambiguate those ambiguous alignments and as a result you get counts for all of your aligned reads. in a genome with alternative splicing you lose too much data using htseq-count, in my opinion.
Or its because location 2 has tons of non-unique mapping and the read counts are very low??

Eventually if I would be doing the differential expression on DESeq, would this give me a false result at location 2 since there is only 7 count but there are much more reads??

Appreciate any feedback

Can anyone help explain why is there this discrepancy between IGV and count data??
zaki is offline   Reply With Quote
Old 03-06-2013, 11:05 PM   #15
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 979
Default

Please take a few of these many reads in location 2, hover your mouse over them in IGV to get the read name, then grep for the read name to find the corresponding lines in the SAM file. Maybe they are all non-unique mapping reads.

The fact that htseq-count does not count these reads is deliberate, of course. See earlier threads for an explanation.
Simon Anders is offline   Reply With Quote
Old 03-07-2013, 08:26 PM   #16
zaki
Member
 
Location: Malaysia

Join Date: Dec 2012
Posts: 15
Default

Quote:
Originally Posted by Simon Anders View Post
Please take a few of these many reads in location 2, hover your mouse over them in IGV to get the read name, then grep for the read name to find the corresponding lines in the SAM file. Maybe they are all non-unique mapping reads.

The fact that htseq-count does not count these reads is deliberate, of course. See earlier threads for an explanation.
Many thanks Simon! I think I need to get my basic understanding of mapping solid.

Forgive my naivety, If multi-mapping becomes an issue when we are translating mapping reads to count data (expression level)...what than would be an acceptable standard to extract count data?? (is it using htseq-count ,bedtools, cuffdiff)

Also....during the mapping stage..what bias would be introduced, if we discard multimap reads?


Edit - you are correct in saying that the reads at location 2 is multimapped, many thanks for the insight!
zaki is offline   Reply With Quote
Old 03-07-2013, 11:47 PM   #17
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 979
Default

See my post #4 in this thread: http://seqanswers.com/forums/showthread.php?t=9129
Simon Anders is offline   Reply With Quote
Old 03-12-2013, 08:39 AM   #18
lpachter
Member
 
Location: Berkeley, cA

Join Date: Feb 2010
Posts: 38
Default

Dear Zaki,

You ask an excellent question. This paper answers it in detail:
http://www.nature.com/nbt/journal/v3.../nbt.2450.html
Lior
lpachter is offline   Reply With Quote
Old 04-01-2013, 12:45 AM   #19
zaki
Member
 
Location: Malaysia

Join Date: Dec 2012
Posts: 15
Default

Thanks Simon and Lior for the suggested readings!

Taking a step back and reading the fundamentals behind each approach is really essential. I think its something I overlooked at first.

Since the protocol and vignette of each approach is very well described and easy to follow, I guess I jumped the gun and started running commands without understanding much behind it (I think I still have a lot yet to understand... statistics...mapping..biology...etc...)

Thanks again for your comments and direction! It helped a lot.



Zaki

Last edited by zaki; 04-01-2013 at 01:47 AM.
zaki is offline   Reply With Quote
Old 07-17-2013, 08:13 AM   #20
Jose Garcia
Junior Member
 
Location: Milan

Join Date: Jul 2013
Posts: 4
Default

Quote:
Originally Posted by Simon Anders View Post
Please take a few of these many reads in location 2, hover your mouse over them in IGV to get the read name, then grep for the read name to find the corresponding lines in the SAM file. Maybe they are all non-unique mapping reads.

The fact that htseq-count does not count these reads is deliberate, of course. See earlier threads for an explanation.
Indeed. I was having the same problem. I was getting crazy because I had already checked what Simon pointed about the possibility of having non-unique mapping reads. I think I have found out why.
If you use htseq-count with -i gene_id, it will construct the set where to add up reads based on the ENSEMBL GENE ID, which will sum reads that could come from different ENSEMBL transcript ID. You may lose the information about transcripts at this level (maybe you can use next DEXSeq approach to look inside...) but at least, you do not lose reads mapping to different identifiers(transcript id), as was my case because I used the UCSC gtf file where gene_id=trascritpt_id.

However, I still encountered IGV regions in my BAM files with many reads that htseq count anyhow flagged as ambigous, but were not multi aligned. I think I have found out why by looking at the locus with the ensembl genome browser and looking for the "gene_name", that is, the SYMBOL. I realized that the Ensembl annotation (present in the GTF) contained for the same locus, with the same gene_name (SYMBOL), two different ENSEMBLGENE Ids for what it called: "protein coding" and "non-sense-mediated decay", as gene biotype. Both lying in the same locus with high overlap. I guessed then that htseq-count had found reads aligning to both ENSEMBLGENE identifiers and discarded them as ambiguous.

While discarding reads mapping to different locus or different genes which overlap (hence also different gene Symbols) may have sense, I do not see the sense in discarding reads merely because we have a maybe too exhaustive overlapping annotation. We might also have other features like this, antisense ncRNAs, pseudogenes, for instance, which may have a big overlapping with its parental, and hence, all counts in regions with the same sequence will be discarded.
For the pseudogenes, unfortunately, if they have a huge overlap, all those reads will be discarded by htseq-count -i gene_name either, but if we are not interested in looking at pseudogenes we could choose to use a more limited annotation like RefSeq.

So, what do you think of using as a global strategy with htseq-count when interested in having also all non coding, pseudogenes et. al a command line like this:

htseq-count -m union -t exon -i gene_name - $GTF > $out

It is always possible to try and deconvolve counts thereafter,
What do you think about it?
Thanks
Best
Jose

Last edited by Jose Garcia; 07-17-2013 at 08:28 AM.
Jose Garcia is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:27 AM.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.