SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
cufflinks : analysis comparison with and without a gtf reference file sohnic Bioinformatics 3 07-07-2019 06:40 AM
how to get a gtf file for cufflinks camelbbs Bioinformatics 9 07-07-2019 06:30 AM
cufflinks won't read my GTF file moriah Bioinformatics 0 08-28-2011 01:31 AM
Tophat/Cufflinks GTF (canine/non-human) M&M RNA Sequencing 0 08-01-2011 04:00 PM
Cufflinks GTF file ECHo Bioinformatics 0 02-15-2010 03:59 AM

Reply
 
Thread Tools
Old 07-13-2011, 04:48 AM   #1
sdarko
Member
 
Location: Bethesda, MD

Join Date: Apr 2009
Posts: 51
Default Best source for GTF file for use with TopHat/Cufflinks

I've been grabbing the "refSeq genes" table (human, hg19) from UCSC in GTF file format for use with TopHat/Cufflinks.

I was just curious as to what everyone else is using and if I might find a more optimal GTF file to use.
sdarko is offline   Reply With Quote
Old 07-13-2011, 06:06 AM   #2
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 838
Default

That's what I've been using, because I can understand how to use the UCSC table browser to output a GTF file (just change the output format). Just be mindful that it updates fairly frequently, and you might discover new annotations by looking again at something you mapped a few months ago.

Last edited by gringer; 07-13-2011 at 06:07 AM. Reason: got the URL wrong
gringer is offline   Reply With Quote
Old 07-14-2011, 03:42 AM   #3
gavin.oliver
Senior Member
 
Location: uk

Join Date: Jan 2010
Posts: 110
Default

Ensembl provides GTF files with each build as standard and is much more comprehensive than Refseq alone. I have switched to using the genome and GTF from Ensembl as a result.
gavin.oliver is offline   Reply With Quote
Old 07-14-2011, 06:03 AM   #4
sdarko
Member
 
Location: Bethesda, MD

Join Date: Apr 2009
Posts: 51
Default

Quote:
Originally Posted by gavin.oliver View Post
Ensembl provides GTF files with each build as standard and is much more comprehensive than Refseq alone. I have switched to using the genome and GTF from Ensembl as a result.
Thanks for the advise. I will try it out today.
sdarko is offline   Reply With Quote
Old 07-19-2011, 06:59 AM   #5
sdarko
Member
 
Location: Bethesda, MD

Join Date: Apr 2009
Posts: 51
Default

Quote:
Originally Posted by gavin.oliver View Post
Ensembl provides GTF files with each build as standard and is much more comprehensive than Refseq alone. I have switched to using the genome and GTF from Ensembl as a result.
Did you get your genome and gtf from the ensembl website?

Other than renaming the gtf and fasta header entries so that they matched, was there anything else that you had to do to make the gtf file work with tophat? I'm getting an error that my gtf doesn't contain junctions.
sdarko is offline   Reply With Quote
Old 07-19-2011, 07:16 AM   #6
gavin.oliver
Senior Member
 
Location: uk

Join Date: Jan 2010
Posts: 110
Default

I didn't have to do anything else, no.

What command are you using to execute Tophat?
gavin.oliver is offline   Reply With Quote
Old 07-19-2011, 07:32 AM   #7
sdarko
Member
 
Location: Bethesda, MD

Join Date: Apr 2009
Posts: 51
Default

Quote:
Originally Posted by gavin.oliver View Post
I didn't have to do anything else, no.

What command are you using to execute Tophat?
I want to make sure that I'm not messing up anything too basic first.

I grabbed the reference genome in fasta format from here --> ftp://ftp.ensembl.org/pub/release-63...o_sapiens/dna/

I grabbed the associated GTF file from here -->ftp://ftp.ensembl.org/pub/release-63/gtf/homo_sapiens/

I then process them so that the entry names match in both files.

Here are a few lines from my GTF from ensembl:
Code:
chr18           protein_coding  exon    49501   49557   .       -       .        gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "1"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001";
chr18           protein_coding  CDS     49501   49557   .       -       0        gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "1"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001"; protein_id "ENSP00000309431";
chr18           protein_coding  start_codon     49555   49557   .       -       0        gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "1"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001";
chr18           protein_coding  exon    49129   49237   .       -       .        gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "2"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001";
chr18           protein_coding  CDS     49129   49237   .       -       0        gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "2"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001"; protein_id "ENSP00000309431";
chr18           protein_coding  exon    48940   49050   .       -       .        gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "3"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001";
chr18           protein_coding  CDS     48940   49050   .       -       2        gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "3"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001"; protein_id "ENSP00000309431";
chr18           protein_coding  exon    47390   48447   .       -       .        gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "4"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001";
chr18           protein_coding  CDS     47393   48447   .       -       2        gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "4"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001"; protein_id "ENSP00000309431";
chr18           protein_coding  stop_codon      47390   47392   .       -       0        gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "4"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001";
chr18           miRNA   exon    48162   48272   .       +       .        gene_id "ENSG00000221441"; transcript_id "ENST00000408514"; exon_number "1"; gene_name "AP001005.1"; transcript_name "AP001005.1-201";
chr18           protein_coding  exon    158483  158714  .       +       .        gene_id "ENSG00000101557"; transcript_id "ENST00000261601"; exon_number "1"; gene_name "USP14"; transcript_name "USP14-201";
chr18           protein_coding  CDS     158699  158714  .       +       0        gene_id "ENSG00000101557"; transcript_id "ENST00000261601"; exon_number "1"; gene_name "USP14"; transcript_name "USP14-201"; protein_id "ENSP00000261601";
chr18           protein_coding  start_codon     158699  158701  .       +       0        gene_id "ENSG00000101557"; transcript_id "ENST00000261601"; exon_number "1"; gene_name "USP14"; transcript_name "USP14-201";
chr18           protein_coding  exon    163308  163453  .       +       .        gene_id "ENSG00000101557"; transcript_id "ENST00000261601"; exon_number "2"; gene_name "USP14"; transcript_name "USP14-201";
chr18           protein_coding  CDS     163308  163453  .       +       2        gene_id "ENSG00000101557"; transcript_id "ENST00000261601"; exon_number "2"; gene_name "USP14"; transcript_name "USP14-201"; protein_id "ENSP00000261601";
chr18           protein_coding  exon    166787  166819  .       +       .        gene_id "ENSG00000101557"; transcript_id "ENST00000261601"; exon_number "3"; gene_name "USP14"; transcript_name "USP14-201";
chr18           protein_coding  CDS     166787  166819  .       +       0        gene_id "ENSG00000101557"; transcript_id "ENST00000261601"; exon_number "3"; gene_name "USP14"; transcript_name "USP14-201"; protein_id "ENSP00000261601";
Here are the names of my chromosomes according to bowtie-inspect:
Code:
chr1
chr2
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chr20
chr21
chr22
chrX
chrY
chrM
In TopHat, I get the following error:
Code:
[Mon Jul 18 15:15:21 2011] Reading known junctions from GTF file
        Warning: TopHat did not find any junctions in GTF file
In Cufflinks, I get the following error:
Code:
[08:34:37] Loading reference annotation.
Error: duplicate GFF ID 'ENST00000445581' encountered!
sdarko is offline   Reply With Quote
Old 07-19-2011, 07:38 AM   #8
gavin.oliver
Senior Member
 
Location: uk

Join Date: Jan 2010
Posts: 110
Default

The only difference I can see between my setup and yours is that I removed the 'chr' prefixes. There was a reason for this - but I can't remember what it was!
gavin.oliver is offline   Reply With Quote
Old 07-19-2011, 07:43 AM   #9
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 838
Default

Well, if chromosome 22 is anything to go by, it looks like the chromosome labels in the fasta file don't include the 'chr' bit.
gringer is offline   Reply With Quote
Old 07-19-2011, 08:09 AM   #10
sdarko
Member
 
Location: Bethesda, MD

Join Date: Apr 2009
Posts: 51
Default

Well, I figured out the problem. I put an extra tab between the chrom name and the second column in the gtf file when I renamed everything.

I fixed it and all is well.
sdarko is offline   Reply With Quote
Old 07-19-2011, 08:10 AM   #11
gavin.oliver
Senior Member
 
Location: uk

Join Date: Jan 2010
Posts: 110
Default

Quote:
Originally Posted by sdarko View Post
Well, I figured out the problem. I put an extra tab between the chrom name and the second column in the gtf file when I renamed everything.

I fixed it and all is well.
Glad to hear it
gavin.oliver is offline   Reply With Quote
Old 10-18-2011, 03:07 AM   #12
hbt
Member
 
Location: UK

Join Date: Jan 2011
Posts: 20
Default

@sdarko Could you explain little how you went about "renaming the gtf and fasta header entries so that they matched" please.
I'm keen to update the gtf I use with tophat to the ensembl version.

many thanks for any advice you may be able to give
hbt is offline   Reply With Quote
Old 10-24-2011, 09:46 AM   #13
shurjo
Senior Member
 
Location: Rockville, MD

Join Date: Jan 2009
Posts: 126
Default

Look here: http://cufflinks.cbcb.umd.edu/igenomes.html
shurjo is offline   Reply With Quote
Old 10-24-2011, 11:18 AM   #14
kopi-o
Senior Member
 
Location: Stockholm, Sweden

Join Date: Feb 2008
Posts: 319
Default

If you download both the genome FASTA and the annotation from ENSEMBL, you shouldn't need to rename anything.

Last edited by kopi-o; 10-24-2011 at 11:31 AM. Reason: clarity
kopi-o is offline   Reply With Quote
Old 10-11-2012, 08:57 PM   #15
HSV-1
Member
 
Location: asia

Join Date: Jul 2012
Posts: 38
Default

Hi, gavin,
I have analysed my RNA-seq data with the references (both genome reference and annotation reference) from UCSC and ensemble. What I found is that the map results with the reference from UCSC is much more than those with the reference from ensemble. What I don't understand is that how this is possible?
What is confusing me much more is that GTF from UCSC is less than half of the one in Ensemble. With a smaller reference I got more results !
Do you have any idea?
Thanks!

Quote:
Originally Posted by gavin.oliver View Post
Ensembl provides GTF files with each build as standard and is much more comprehensive than Refseq alone. I have switched to using the genome and GTF from Ensembl as a result.
HSV-1 is offline   Reply With Quote
Old 12-14-2012, 12:36 PM   #16
carmeyeii
Senior Member
 
Location: Mexico

Join Date: Mar 2011
Posts: 137
Default

So you can supply TopHat with a GTF file of annotated transcripts, which, using the --GTF option, will be the first place where reads are mapped, followed by the whole genome, with or without novel junction discovery in this second stage. As I understand it, this is after TopHat 1.4.
I'm curious to know how t was before 1.4. I think you could already give TopHat a GTF file, but it used it second. Am I right? If so, what is the difference between using it [the GTF file] first and using it second after the genome?

Carmen
carmeyeii is offline   Reply With Quote
Old 12-14-2012, 12:42 PM   #17
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 838
Default

Quote:
I'm curious to know how it was before 1.4
I'm not sure about version numbers, but tophat's previous approach didn't try mapping to a transcriptome first. The current approach assembles a transcriptome from the GTF file and maps to that first, then does the novel junction discovery after that on the substantially smaller remainder of reads.

The previous approach was slower (and possibly less accurate), because the junction discovery would be done on all reads, rather than just the ones that didn't map to the known transcriptome.
gringer is offline   Reply With Quote
Old 12-14-2012, 12:48 PM   #18
carmeyeii
Senior Member
 
Location: Mexico

Join Date: Mar 2011
Posts: 137
Default

I see, but it could still use a GTF file at some point, right?
carmeyeii is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:09 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO