Seqanswers Leaderboard Ad

**gringer** · 07-13-2011, 05:06 AM

That's what I've been using, because I can understand how to use the UCSC table browser to output a GTF file (just change the output format). Just be mindful that it updates fairly frequently, and you might discover new annotations by looking again at something you mapped a few months ago.

**gavin.oliver** · 07-14-2011, 02:42 AM

Ensembl provides GTF files with each build as standard and is much more comprehensive than Refseq alone. I have switched to using the genome and GTF from Ensembl as a result.

**sdarko** · 07-14-2011, 05:03 AM

Originally posted by gavin.oliver View Post

Ensembl provides GTF files with each build as standard and is much more comprehensive than Refseq alone. I have switched to using the genome and GTF from Ensembl as a result.

Thanks for the advise. I will try it out today.

**sdarko** · 07-19-2011, 05:59 AM

Originally posted by gavin.oliver View Post

Ensembl provides GTF files with each build as standard and is much more comprehensive than Refseq alone. I have switched to using the genome and GTF from Ensembl as a result.

Did you get your genome and gtf from the ensembl website?

Other than renaming the gtf and fasta header entries so that they matched, was there anything else that you had to do to make the gtf file work with tophat? I'm getting an error that my gtf doesn't contain junctions.

**gavin.oliver** · 07-19-2011, 06:16 AM

I didn't have to do anything else, no.

What command are you using to execute Tophat?

**sdarko** · 07-19-2011, 06:32 AM

Originally posted by gavin.oliver View Post

I didn't have to do anything else, no.

What command are you using to execute Tophat?

I want to make sure that I'm not messing up anything too basic first.

I grabbed the reference genome in fasta format from here --> ftp://ftp.ensembl.org/pub/release-63...o_sapiens/dna/

I grabbed the associated GTF file from here -->ftp://ftp.ensembl.org/pub/release-63/gtf/homo_sapiens/

I then process them so that the entry names match in both files.

Here are a few lines from my GTF from ensembl:

Code:

chr18           protein_coding  exon    49501   49557   .       -       .        gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "1"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001";
chr18           protein_coding  CDS     49501   49557   .       -       0        gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "1"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001"; protein_id "ENSP00000309431";
chr18           protein_coding  start_codon     49555   49557   .       -       0        gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "1"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001";
chr18           protein_coding  exon    49129   49237   .       -       .        gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "2"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001";
chr18           protein_coding  CDS     49129   49237   .       -       0        gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "2"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001"; protein_id "ENSP00000309431";
chr18           protein_coding  exon    48940   49050   .       -       .        gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "3"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001";
chr18           protein_coding  CDS     48940   49050   .       -       2        gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "3"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001"; protein_id "ENSP00000309431";
chr18           protein_coding  exon    47390   48447   .       -       .        gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "4"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001";
chr18           protein_coding  CDS     47393   48447   .       -       2        gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "4"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001"; protein_id "ENSP00000309431";
chr18           protein_coding  stop_codon      47390   47392   .       -       0        gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "4"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001";
chr18           miRNA   exon    48162   48272   .       +       .        gene_id "ENSG00000221441"; transcript_id "ENST00000408514"; exon_number "1"; gene_name "AP001005.1"; transcript_name "AP001005.1-201";
chr18           protein_coding  exon    158483  158714  .       +       .        gene_id "ENSG00000101557"; transcript_id "ENST00000261601"; exon_number "1"; gene_name "USP14"; transcript_name "USP14-201";
chr18           protein_coding  CDS     158699  158714  .       +       0        gene_id "ENSG00000101557"; transcript_id "ENST00000261601"; exon_number "1"; gene_name "USP14"; transcript_name "USP14-201"; protein_id "ENSP00000261601";
chr18           protein_coding  start_codon     158699  158701  .       +       0        gene_id "ENSG00000101557"; transcript_id "ENST00000261601"; exon_number "1"; gene_name "USP14"; transcript_name "USP14-201";
chr18           protein_coding  exon    163308  163453  .       +       .        gene_id "ENSG00000101557"; transcript_id "ENST00000261601"; exon_number "2"; gene_name "USP14"; transcript_name "USP14-201";
chr18           protein_coding  CDS     163308  163453  .       +       2        gene_id "ENSG00000101557"; transcript_id "ENST00000261601"; exon_number "2"; gene_name "USP14"; transcript_name "USP14-201"; protein_id "ENSP00000261601";
chr18           protein_coding  exon    166787  166819  .       +       .        gene_id "ENSG00000101557"; transcript_id "ENST00000261601"; exon_number "3"; gene_name "USP14"; transcript_name "USP14-201";
chr18           protein_coding  CDS     166787  166819  .       +       0        gene_id "ENSG00000101557"; transcript_id "ENST00000261601"; exon_number "3"; gene_name "USP14"; transcript_name "USP14-201"; protein_id "ENSP00000261601";

Here are the names of my chromosomes according to bowtie-inspect:

Code:

chr1
chr2
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chr20
chr21
chr22
chrX
chrY
chrM

In TopHat, I get the following error:

Code:

[Mon Jul 18 15:15:21 2011] Reading known junctions from GTF file
        Warning: TopHat did not find any junctions in GTF file

In Cufflinks, I get the following error:

Code:

[08:34:37] Loading reference annotation.
Error: duplicate GFF ID 'ENST00000445581' encountered!

**gavin.oliver** · 07-19-2011, 06:38 AM

The only difference I can see between my setup and yours is that I removed the 'chr' prefixes. There was a reason for this - but I can't remember what it was!

**gringer** · 07-19-2011, 06:43 AM

Well, if chromosome 22 is anything to go by, it looks like the chromosome labels in the fasta file don't include the 'chr' bit.

**sdarko** · 07-19-2011, 07:09 AM

Well, I figured out the problem. I put an extra tab between the chrom name and the second column in the gtf file when I renamed everything.

I fixed it and all is well.

**gavin.oliver** · 07-19-2011, 07:10 AM

Originally posted by sdarko View Post

Well, I figured out the problem. I put an extra tab between the chrom name and the second column in the gtf file when I renamed everything.

I fixed it and all is well.

Glad to hear it

**hbt** · 10-18-2011, 02:07 AM

@sdarko Could you explain little how you went about "renaming the gtf and fasta header entries so that they matched" please.
I'm keen to update the gtf I use with tophat to the ensembl version.

many thanks for any advice you may be able to give

**shurjo** · 10-24-2011, 08:46 AM

Look here: http://cufflinks.cbcb.umd.edu/igenomes.html

**kopi-o** · 10-24-2011, 10:18 AM

If you download both the genome FASTA and the annotation from ENSEMBL, you shouldn't need to rename anything.

**HSV-1** · 10-11-2012, 07:57 PM

Hi, gavin,
I have analysed my RNA-seq data with the references (both genome reference and annotation reference) from UCSC and ensemble. What I found is that the map results with the reference from UCSC is much more than those with the reference from ensemble. What I don't understand is that how this is possible?
What is confusing me much more is that GTF from UCSC is less than half of the one in Ensemble. With a smaller reference I got more results !
Do you have any idea?
Thanks!

Originally posted by gavin.oliver View Post

Ensembl provides GTF files with each build as standard and is much more comprehensive than Refseq alone. I have switched to using the genome and GTF from Ensembl as a result.

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Today, 11:49 AM	0 responses 8 views 0 likes	Last Post by seqadmin Today, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Best source for GTF file for use with TopHat/Cufflinks

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News