![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
cufflinks : analysis comparison with and without a gtf reference file | sohnic | Bioinformatics | 3 | 07-07-2019 06:40 AM |
how to get a gtf file for cufflinks | camelbbs | Bioinformatics | 9 | 07-07-2019 06:30 AM |
cufflinks won't read my GTF file | moriah | Bioinformatics | 0 | 08-28-2011 01:31 AM |
Tophat/Cufflinks GTF (canine/non-human) | M&M | RNA Sequencing | 0 | 08-01-2011 04:00 PM |
Cufflinks GTF file | ECHo | Bioinformatics | 0 | 02-15-2010 03:59 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: Bethesda, MD Join Date: Apr 2009
Posts: 51
|
![]()
I've been grabbing the "refSeq genes" table (human, hg19) from UCSC in GTF file format for use with TopHat/Cufflinks.
I was just curious as to what everyone else is using and if I might find a more optimal GTF file to use. |
![]() |
![]() |
![]() |
#2 |
David Eccles (gringer)
Location: Wellington, New Zealand Join Date: May 2011
Posts: 838
|
![]()
That's what I've been using, because I can understand how to use the UCSC table browser to output a GTF file (just change the output format). Just be mindful that it updates fairly frequently, and you might discover new annotations by looking again at something you mapped a few months ago.
Last edited by gringer; 07-13-2011 at 06:07 AM. Reason: got the URL wrong |
![]() |
![]() |
![]() |
#3 |
Senior Member
Location: uk Join Date: Jan 2010
Posts: 110
|
![]()
Ensembl provides GTF files with each build as standard and is much more comprehensive than Refseq alone. I have switched to using the genome and GTF from Ensembl as a result.
|
![]() |
![]() |
![]() |
#4 |
Member
Location: Bethesda, MD Join Date: Apr 2009
Posts: 51
|
![]() |
![]() |
![]() |
![]() |
#5 | |
Member
Location: Bethesda, MD Join Date: Apr 2009
Posts: 51
|
![]() Quote:
Other than renaming the gtf and fasta header entries so that they matched, was there anything else that you had to do to make the gtf file work with tophat? I'm getting an error that my gtf doesn't contain junctions. |
|
![]() |
![]() |
![]() |
#6 |
Senior Member
Location: uk Join Date: Jan 2010
Posts: 110
|
![]()
I didn't have to do anything else, no.
What command are you using to execute Tophat? |
![]() |
![]() |
![]() |
#7 | |
Member
Location: Bethesda, MD Join Date: Apr 2009
Posts: 51
|
![]() Quote:
I grabbed the reference genome in fasta format from here --> ftp://ftp.ensembl.org/pub/release-63...o_sapiens/dna/ I grabbed the associated GTF file from here -->ftp://ftp.ensembl.org/pub/release-63/gtf/homo_sapiens/ I then process them so that the entry names match in both files. Here are a few lines from my GTF from ensembl: Code:
chr18 protein_coding exon 49501 49557 . - . gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "1"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001"; chr18 protein_coding CDS 49501 49557 . - 0 gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "1"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001"; protein_id "ENSP00000309431"; chr18 protein_coding start_codon 49555 49557 . - 0 gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "1"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001"; chr18 protein_coding exon 49129 49237 . - . gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "2"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001"; chr18 protein_coding CDS 49129 49237 . - 0 gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "2"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001"; protein_id "ENSP00000309431"; chr18 protein_coding exon 48940 49050 . - . gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "3"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001"; chr18 protein_coding CDS 48940 49050 . - 2 gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "3"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001"; protein_id "ENSP00000309431"; chr18 protein_coding exon 47390 48447 . - . gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "4"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001"; chr18 protein_coding CDS 47393 48447 . - 2 gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "4"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001"; protein_id "ENSP00000309431"; chr18 protein_coding stop_codon 47390 47392 . - 0 gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "4"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001"; chr18 miRNA exon 48162 48272 . + . gene_id "ENSG00000221441"; transcript_id "ENST00000408514"; exon_number "1"; gene_name "AP001005.1"; transcript_name "AP001005.1-201"; chr18 protein_coding exon 158483 158714 . + . gene_id "ENSG00000101557"; transcript_id "ENST00000261601"; exon_number "1"; gene_name "USP14"; transcript_name "USP14-201"; chr18 protein_coding CDS 158699 158714 . + 0 gene_id "ENSG00000101557"; transcript_id "ENST00000261601"; exon_number "1"; gene_name "USP14"; transcript_name "USP14-201"; protein_id "ENSP00000261601"; chr18 protein_coding start_codon 158699 158701 . + 0 gene_id "ENSG00000101557"; transcript_id "ENST00000261601"; exon_number "1"; gene_name "USP14"; transcript_name "USP14-201"; chr18 protein_coding exon 163308 163453 . + . gene_id "ENSG00000101557"; transcript_id "ENST00000261601"; exon_number "2"; gene_name "USP14"; transcript_name "USP14-201"; chr18 protein_coding CDS 163308 163453 . + 2 gene_id "ENSG00000101557"; transcript_id "ENST00000261601"; exon_number "2"; gene_name "USP14"; transcript_name "USP14-201"; protein_id "ENSP00000261601"; chr18 protein_coding exon 166787 166819 . + . gene_id "ENSG00000101557"; transcript_id "ENST00000261601"; exon_number "3"; gene_name "USP14"; transcript_name "USP14-201"; chr18 protein_coding CDS 166787 166819 . + 0 gene_id "ENSG00000101557"; transcript_id "ENST00000261601"; exon_number "3"; gene_name "USP14"; transcript_name "USP14-201"; protein_id "ENSP00000261601"; Code:
chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY chrM Code:
[Mon Jul 18 15:15:21 2011] Reading known junctions from GTF file Warning: TopHat did not find any junctions in GTF file Code:
[08:34:37] Loading reference annotation. Error: duplicate GFF ID 'ENST00000445581' encountered! |
|
![]() |
![]() |
![]() |
#8 |
Senior Member
Location: uk Join Date: Jan 2010
Posts: 110
|
![]()
The only difference I can see between my setup and yours is that I removed the 'chr' prefixes. There was a reason for this - but I can't remember what it was!
|
![]() |
![]() |
![]() |
#9 |
David Eccles (gringer)
Location: Wellington, New Zealand Join Date: May 2011
Posts: 838
|
![]()
Well, if chromosome 22 is anything to go by, it looks like the chromosome labels in the fasta file don't include the 'chr' bit.
|
![]() |
![]() |
![]() |
#10 |
Member
Location: Bethesda, MD Join Date: Apr 2009
Posts: 51
|
![]()
Well, I figured out the problem. I put an extra tab between the chrom name and the second column in the gtf file when I renamed everything.
I fixed it and all is well. |
![]() |
![]() |
![]() |
#11 |
Senior Member
Location: uk Join Date: Jan 2010
Posts: 110
|
![]() |
![]() |
![]() |
![]() |
#12 |
Member
Location: UK Join Date: Jan 2011
Posts: 20
|
![]()
@sdarko Could you explain little how you went about "renaming the gtf and fasta header entries so that they matched" please.
I'm keen to update the gtf I use with tophat to the ensembl version. many thanks for any advice you may be able to give |
![]() |
![]() |
![]() |
#13 |
Senior Member
Location: Rockville, MD Join Date: Jan 2009
Posts: 126
|
![]()
Look here: http://cufflinks.cbcb.umd.edu/igenomes.html
|
![]() |
![]() |
![]() |
#14 |
Senior Member
Location: Stockholm, Sweden Join Date: Feb 2008
Posts: 319
|
![]()
If you download both the genome FASTA and the annotation from ENSEMBL, you shouldn't need to rename anything.
Last edited by kopi-o; 10-24-2011 at 11:31 AM. Reason: clarity |
![]() |
![]() |
![]() |
#15 |
Member
Location: asia Join Date: Jul 2012
Posts: 38
|
![]()
Hi, gavin,
I have analysed my RNA-seq data with the references (both genome reference and annotation reference) from UCSC and ensemble. What I found is that the map results with the reference from UCSC is much more than those with the reference from ensemble. What I don't understand is that how this is possible? What is confusing me much more is that GTF from UCSC is less than half of the one in Ensemble. With a smaller reference I got more results ! Do you have any idea? Thanks! |
![]() |
![]() |
![]() |
#16 |
Senior Member
Location: Mexico Join Date: Mar 2011
Posts: 137
|
![]()
So you can supply TopHat with a GTF file of annotated transcripts, which, using the --GTF option, will be the first place where reads are mapped, followed by the whole genome, with or without novel junction discovery in this second stage. As I understand it, this is after TopHat 1.4.
I'm curious to know how t was before 1.4. I think you could already give TopHat a GTF file, but it used it second. Am I right? If so, what is the difference between using it [the GTF file] first and using it second after the genome? Carmen |
![]() |
![]() |
![]() |
#17 | |
David Eccles (gringer)
Location: Wellington, New Zealand Join Date: May 2011
Posts: 838
|
![]() Quote:
The previous approach was slower (and possibly less accurate), because the junction discovery would be done on all reads, rather than just the ones that didn't map to the known transcriptome. |
|
![]() |
![]() |
![]() |
#18 |
Senior Member
Location: Mexico Join Date: Mar 2011
Posts: 137
|
![]()
I see, but it could still use a GTF file at some point, right?
|
![]() |
![]() |
![]() |
Thread Tools | |
|
|