![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
cufflinks : analysis comparison with and without a gtf reference file | sohnic | Bioinformatics | 3 | 07-07-2019 06:40 AM |
Best source for GTF file for use with TopHat/Cufflinks | sdarko | Bioinformatics | 17 | 12-14-2012 12:48 PM |
cufflinks won't read my GTF file | moriah | Bioinformatics | 0 | 08-28-2011 01:31 AM |
Cufflinks' computation of FPKM for --GTF and --GTF-guide estimation | burt | Bioinformatics | 0 | 08-24-2011 12:59 AM |
Cufflinks GTF file | ECHo | Bioinformatics | 0 | 02-15-2010 03:59 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: United States Join Date: Jun 2011
Posts: 49
|
![]()
Hi,
can i ask how to get a gtf file for tophat or cufflinks? I just use ucsc table browser to get a gtf file, content is like this: Code:
chr1 hg19_refGene start_codon 67000042 67000044 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291"; Thanks, Peter |
![]() |
![]() |
![]() |
#2 |
Member
Location: University Park, PA Join Date: Apr 2008
Posts: 27
|
![]()
I recommend downloading GTF annotation from this page:
http://cufflinks.cbcb.umd.edu/igenomes.html These files were designed to go with Tophat/Cufflinks and have all the expected fields. |
![]() |
![]() |
![]() |
#3 |
Member
Location: United States Join Date: Jun 2011
Posts: 49
|
![]()
thanks so much
|
![]() |
![]() |
![]() |
#4 |
Member
Location: Connecticut Join Date: May 2010
Posts: 42
|
![]()
Hi camelbbs,
Just saw your post and thought I'd give you a word of caution. Make sure you take a look at your annotation closely. We just started to play around with the iGenomes stuff, but I'll tell you right now that our usage of different annotations from UCSC (RefSeq, ENSEMBL, and Gencode (which should be pretty much the same as ENSEMBL) (and iGenome) have lead to very different results in different cases. Sort of depends on your question, but make sure that the annotation you are looking at is good for the stuff your most concerned about. One would hope that choice of annotation would be a robust parameter in these types of analysis, but we haven't found that to be the case. In the end, those of us who don't have the time to spend inordinate amount of time vetting these things have to take a close look at them and then make a decision to stick with. Good luck. |
![]() |
![]() |
![]() |
#5 |
Member
Location: Connecticut Join Date: May 2010
Posts: 42
|
![]()
Hi camelbbs,
Just saw your post and thought I'd give you a word of caution. Make sure you take a look at your annotation closely. We just started to play around with the iGenomes stuff, but I'll tell you right now that our usage of different annotations from UCSC (RefSeq, ENSEMBL, and Gencode (which should be pretty much the same as ENSEMBL) (and iGenome) have lead to very different results in different cases. Sort of depends on your question, but make sure that the annotation you are looking at is good for the stuff your most concerned about. One would hope that choice of annotation would be a robust parameter in these types of analysis, but we haven't found that to be the case. In the end, those of us who don't have the time to spend inordinate amount of time vetting these things have to take a close look at them and then make a decision to stick with. Good luck. |
![]() |
![]() |
![]() |
#6 | |
Member
Location: United States Join Date: Jun 2011
Posts: 49
|
![]() Quote:
So would you have some recommend for gtf choice. Which one is proved by most of work. And I am curious how they link the gene annotation to alignment sequences, is it by the coordinates? |
|
![]() |
![]() |
![]() |
#7 |
Rick Westerman
Location: Purdue University, Indiana, USA Join Date: Jun 2008
Posts: 1,104
|
![]()
As far as I know GTF/GFF annotations are indeed related to the reference sequences (which is what I presume you mean by 'alignment sequences') solely by coordinate positions. This, of course, means that you need to be very careful to pick and use the GTF version that was created for your reference version.
|
![]() |
![]() |
![]() |
#8 |
Member
Location: Connecticut Join Date: May 2010
Posts: 42
|
![]()
Well, camelbbs, "work" is an interesting way to put it. I can let you in on what little I know, but I'm still trying to work through these things myself. Someone else out there might have some more info or insight on this issue than myself. We've looked at four different annotations so far: Gencode, Ensembl, RefSeq, and IGenome (which should be a derivative of Ensembl, I think). We haven't really vetted Gencode or IGenome, because Gencode (hypothetically) should be very similar to Ensembl, while IGenome we just found and only briefly ran it through some stuff. So that brings us to RefSeq and Ensembl and I think most people find those two databases generally acceptable for whatever you would be interested in. Just from some rough calculations Ensembl appears to have about twice as many nucleotides annotated as opposed to RefSeq. This is likely because of a higher level of isoform annotation in Ensembl, so some nucleotides may be doubly annotated (NOTE: when you download the Ensembl ensGene gtf from UCSC and implement it into cufflinks, you end up with the transcript IDs for you genes not the gene IDs - can be very important to you depending on what you want). Just open up a chromosome in hg19 in UCSC genome browser and you can see how different they look. As to why they are different, again I'm not incredibly knowledgeable here, but each of these methods uses slightly different evidence to add to their respective databases. Mostly, I think they probably vary in two ways: 1) the computational methods they use for predicted gene tracks and 2) in the curation of these database. In the past, RefSeq was more submission based, so the evidence requirements would have appeared to be higher, but that all conjecture on my part. In the end, we've bascially come to view RefSeq as more conservative and gene-oriented and Ensembl as more computationally developed and more transcript-oriented. Anyone else out there have some better insight than me? I would love to hear it.
To answer your other question, yes, coordinates, chromosome, and ID determine genomic location and annotation. |
![]() |
![]() |
![]() |
#9 |
Member
Location: Connecticut Join Date: May 2010
Posts: 42
|
![]()
Well, camelbbs, "work" is an interesting way to put it. I can let you in on what little I know, but I'm still trying to work through these things myself. Someone else out there might have some more info or insight on this issue than myself. We've looked at four different annotations so far: Gencode, Ensembl, RefSeq, and IGenome (which should be a derivative of Ensembl, I think). We haven't really vetted Gencode or IGenome, because Gencode (hypothetically) should be very similar to Ensembl, while IGenome we just found and only briefly ran it through some stuff. So that brings us to RefSeq and Ensembl and I think most people find those two databases generally acceptable for whatever you would be interested in. Just from some rough calculations Ensembl appears to have about twice as many nucleotides annotated as opposed to RefSeq. This is likely because of a higher level of isoform annotation in Ensembl, so some nucleotides may be doubly annotated (NOTE: when you download the Ensembl ensGene gtf from UCSC and implement it into cufflinks, you end up with the transcript IDs for you genes not the gene IDs - can be very important to you depending on what you want). Just open up a chromosome in hg19 in UCSC genome browser and you can see how different they look. As to why they are different, again I'm not incredibly knowledgeable here, but each of these methods uses slightly different evidence to add to their respective databases. Mostly, I think they probably vary in two ways: 1) the computational methods they use for predicted gene tracks and 2) in the curation of these database. In the past, RefSeq was more submission based, so the evidence requirements would have appeared to be higher, but that all conjecture on my part. In the end, we've bascially come to view RefSeq as more conservative and gene-oriented and Ensembl as more computationally developed and more transcript-oriented. Anyone else out there have some better insight than me? I would love to hear it.
To answer your other question, yes, coordinates, chromosome, and ID determine genomic location and annotation. |
![]() |
![]() |
![]() |
#10 |
Member
Location: Bhopal Join Date: Jul 2019
Posts: 19
|
![]()
Just observed your post and thought I'd give you an expression of alert. Ensure you investigate your comment intently. We just began to play around with the iGenomes stuff, yet I'll disclose to you right now that our use of various explanations from UCSC (RefSeq, ENSEMBL, and Gencode (which ought to be essentially equivalent to ENSEMBL) (and iGenome) have lead to altogether different outcomes in various cases. Kind of relies upon your inquiry, yet ensure that the comment you are taking a gander at is useful for the stuff your most worried about.
One would trust that decision of explanation would be a powerful parameter in these sorts of investigation, however we haven't observed that to be the situation. At last, those of us who don't have room schedule-wise to invest exorbitant measure of energy confirming these things need to investigate them and after that settle on a choice to stay with. Good karma. |
![]() |
![]() |
![]() |
Thread Tools | |
|
|