SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
cuffdiff: use merged.gtf from cuffmerge or combined.gtf from cuffcompare? turnersd Bioinformatics 21 10-02-2014 03:41 AM
cuffcompare only generates .gtf.tmap and .gtf.refmap? julio514 Bioinformatics 4 07-15-2011 04:18 AM
^@ in .combined.gtf from cuffcompare v0.8.4 glacierbird Bioinformatics 2 01-14-2011 12:42 PM
how does cuffcompare choose which transcript to put in combined.gtf file? d f Bioinformatics 0 11-09-2010 11:30 AM
cuffcompare can not handle mouse gtf file from ensembl liuxq Bioinformatics 1 09-05-2010 11:54 PM

Reply
 
Thread Tools
Old 03-05-2010, 06:48 PM   #1
ChrisL
Member
 
Location: Sydney

Join Date: Nov 2009
Posts: 14
Default GTF file with gene name attribute for Cuffcompare

Sorry if this question has already been asked, but to get a good annotation with Cuffcompare I need a GTF file with the reference gene symbol name, such as the "Myog" example given in the Cufflinks manual. I can generate a GTF file using the UCSC Table Browser, but all genes and trancripts are named in UCSC format, e.g.; "uc007cr1", which is not a helpful annotation for a biologist.

Is there an easy way to get a GTF file where the reference gene name is the gene symbol? I am working with hg19 and mm9.
ChrisL is offline   Reply With Quote
Old 03-08-2010, 07:31 AM   #2
Wei-HD
Member
 
Location: Germany

Join Date: Oct 2009
Posts: 59
Default

In the manual of Cufflinks:

"Cuffcompare Input

Cuffcompare takes Cufflinks' GTF output as input, and optionally can take a "reference" annotation (such as from Ensembl)"

Just click the Ensemble, you will get the GTF file from each specie. Hope this helps.
Wei-HD is offline   Reply With Quote
Old 03-08-2010, 06:09 PM   #3
ChrisL
Member
 
Location: Sydney

Join Date: Nov 2009
Posts: 14
Default

Yes, I have looked at Ensembl GTF files and they also lack the HGNC gene symbol attribute. Genes are identified by their Ensembl code; e.g.; ENSG00000122180.
ChrisL is offline   Reply With Quote
Old 03-08-2010, 09:53 PM   #4
thinkRNA
Member
 
Location: Carlsbad,CA

Join Date: Jan 2010
Posts: 94
Default

Quote:
Originally Posted by ChrisL View Post
Yes, I have looked at Ensembl GTF files and they also lack the HGNC gene symbol attribute. Genes are identified by their Ensembl code; e.g.; ENSG00000122180.
You will have to convert the ensembl ids to corresponding gene symbols. Check out biomart.
http://www.ensembl.org/biomart/martv...07ae4ce41db010
you can select ensembl gene id and gene symbols and get the file which will help you translate. This will require some programming.
thinkRNA is offline   Reply With Quote
Old 03-08-2010, 10:26 PM   #5
Wei-HD
Member
 
Location: Germany

Join Date: Oct 2009
Posts: 59
Default

I use MGI Batch Query to convert the ENSEMBLE ID to gene name:
http://www.informatics.jax.org/javaw...h?page=batchQF
Wei-HD is offline   Reply With Quote
Old 03-11-2010, 01:20 PM   #6
RockChalkJayhawk
Senior Member
 
Location: Rochester, MN

Join Date: Mar 2009
Posts: 191
Default GTF File for Cufflinks

Have you tried to download the RefSeq refFlat file in GTF format from the UCSC table browser? That might also work (and be a lot easier).
RockChalkJayhawk is offline   Reply With Quote
Old 03-13-2010, 07:19 PM   #7
ChrisL
Member
 
Location: Sydney

Join Date: Nov 2009
Posts: 14
Default

Yes, I used UCSC to generate a GTF file based on RefSeq, but the RefSeq annotation is really no better. For example, the human gene "MYOG" is "NC_000001.10" in RefSeq.

If the GTF file had the gene id in a separate delimited column it would be easy to replace with the HGNC gene symbol using the UNIX join command and a lookup table. Luckily I have access to programmers as it looks like a job for a script.
ChrisL is offline   Reply With Quote
Old 03-14-2010, 07:41 AM   #8
RockChalkJayhawk
Senior Member
 
Location: Rochester, MN

Join Date: Mar 2009
Posts: 191
Default

Quote:
Originally Posted by ChrisL View Post
Yes, I used UCSC to generate a GTF file based on RefSeq, but the RefSeq annotation is really no better. For example, the human gene "MYOG" is "NC_000001.10" in RefSeq.

If the GTF file had the gene id in a separate delimited column it would be easy to replace with the HGNC gene symbol using the UNIX join command and a lookup table. Luckily I have access to programmers as it looks like a job for a script.
Did you use the refGene table of the refFlat table?
RockChalkJayhawk is offline   Reply With Quote
Old 03-14-2010, 02:59 PM   #9
ChrisL
Member
 
Location: Sydney

Join Date: Nov 2009
Posts: 14
Default

Brilliant! That worked.

Thanks RockChalkJayhawk.

Chris
ChrisL is offline   Reply With Quote
Old 04-14-2011, 01:01 PM   #10
genbio64
Member
 
Location: New York

Join Date: Dec 2009
Posts: 42
Default

@RockChalkJayhawk or ChrisL,
Can one of you elaborate on that workflow?
genbio64 is offline   Reply With Quote
Old 04-15-2011, 09:28 AM   #11
jbrwn
Member
 
Location: Denver, CO

Join Date: Mar 2011
Posts: 37
Default

Quote:
Originally Posted by genbio64 View Post
@RockChalkJayhawk or ChrisL,
Can one of you elaborate on that workflow?
ucsc table browser, choose refseq genes for the track then refflat table.
jbrwn is offline   Reply With Quote
Old 04-15-2011, 09:53 AM   #12
filippos
Junior Member
 
Location: boston

Join Date: Feb 2011
Posts: 4
Default

Hi everyone,
I've been scanning this answer to my question but I could not find it. So I saw this post which kind of touches my problem. I downloaded one GTF file with the ENSEMBL annotation and the one you propose here. I used the same GTF in the Tophat, cufflinks and cuffcompare steps but the final output from cuffdiff does not contain any of the 2 annotations. I thought that I had to do another step to match the statistical analysis with the annotation, but I cannot find what that step is. As they are now, the data mean nothing unless I manually much the cufflinks names with the ENSEBL one.
Could please somone explain what I am doing wrong?
Thank you very much,
Filippos
filippos is offline   Reply With Quote
Old 04-15-2011, 01:49 PM   #13
jbrwn
Member
 
Location: Denver, CO

Join Date: Mar 2011
Posts: 37
Default

Quote:
Originally Posted by filippos View Post
Hi everyone,
I've been scanning this answer to my question but I could not find it. So I saw this post which kind of touches my problem. I downloaded one GTF file with the ENSEMBL annotation and the one you propose here. I used the same GTF in the Tophat, cufflinks and cuffcompare steps but the final output from cuffdiff does not contain any of the 2 annotations. I thought that I had to do another step to match the statistical analysis with the annotation, but I cannot find what that step is. As they are now, the data mean nothing unless I manually much the cufflinks names with the ENSEBL one.
Could please somone explain what I am doing wrong?
Thank you very much,
Filippos
you may want other people to verify anything i say, but this is what i think.

make sure you add "chr" to column 1 of your ensemble reference. then use that reference to make your combined gtf in cuffcompare.
Code:
cuffcompare -r ensembl.gtf ensembl.gtf ensembl.gtf
run cufflinks with resultant stdout.combined.gtf
jbrwn is offline   Reply With Quote
Old 04-15-2011, 02:28 PM   #14
filippos
Junior Member
 
Location: boston

Join Date: Feb 2011
Posts: 4
Default

Thank you jbrwn for your answer.
The first lines of the ensembl GTF that I'm using are:

NT_166433 protein_coding exon 11955 12166 . + . gene_id "ENSMUSG00000000702"; transcript_id "ENSMUST00000105216"; exon_number "1"; gene_name "AC007307.1"; transcript_name "AC007307.1-201";
NT_166433 protein_coding CDS 12026 12166 . + 0 gene_id "ENSMUSG00000000702"; transcript_id "ENSMUST00000105216"; exon_number "1"; gene_name "AC007307.1"; transcript_name "AC007307.1-201"; protein_id "ENSMUSP00000100851";
NT_166433 protein_coding start_codon 12026 12028 . + 0 gene_id "ENSMUSG00000000702"; transcript_id "ENSMUST00000105216"; exon_number "1"; gene_name "AC007307.1"; transcript_name "AC007307.1-201";
NT_166433 protein_coding exon 16677 16841 . + . gene_id "ENSMUSG00000000702"; transcript_id "ENSMUST00000105216"; exon_number "2"; gene_name "AC007307.1"; transcript_name "AC007307.1-201";
NT_166433 protein_coding CDS 16677 16841 . + 0 gene_id "ENSMUSG00000000702"; transcript_id "ENSMUST00000105216"; exon_number "2"; gene_name "AC007307.1"; transcript_name "AC007307.1-201"; protein_id "ENSMUSP00000100851";
NT_166433 protein_coding exon 17745 17814 . + . gene_id "ENSMUSG00000000702"; transcript_id "ENSMUST00000105216"; exon_number "3"; gene_name "AC007307.1"; transcript_name "AC007307.1-201";

At some point (around line 100) the thing changes to:

18 protein_coding exon 3122455 3123465 . - . gene_id "ENSMUSG00000091539"; transcript_id "ENSMUST00000165255"; exon_number "1"; gene_name "AC125218.1"; transcript_name "AC125218.1-201";
18 protein_coding CDS 3122495 3123412 . - 0 gene_id "ENSMUSG00000091539"; transcript_id "ENSMUST00000165255"; exon_number "1"; gene_name "AC125218.1"; transcript_name "AC125218.1-201"; protein_id "ENSMUSP00000129804";
18 protein_coding start_codon 3123410 3123412 . - 0 gene_id "ENSMUSG00000091539"; transcript_id "ENSMUST00000165255"; exon_number "1"; gene_name "AC125218.1"; transcript_name "AC125218.1-201";
18 protein_coding stop_codon 3122492 3122494 . - 0 gene_id "ENSMUSG00000091539"; transcript_id "ENSMUST00000165255"; exon_number "1"; gene_name "AC125218.1"; transcript_name "AC125218.1-201";
18 protein_coding exon 3327492 3327589 . - . gene_id "ENSMUSG00000063889"; transcript_id "ENSMUST00000151311"; exon_number "1"; gene_name "Crem"; transcript_name "Crem-020";
18 protein_coding CDS 3327492 3327535 . - 0 gene_id "ENSMUSG00000063889"; transcript_id "ENSMUST00000151311"; exon_number "1"; gene_name "Crem"; transcript_name "Crem-020"; protein_id "ENSMUSP00000118267";
18 protein_coding start_codon 3327533 3327535 . - 0 gene_id "ENSMUSG00000063889"; transcript_id "ENSMUST00000151311"; exon_number "1"; gene_name "Crem"; transcript_name "Crem-020";
18 protein_coding exon 3325359 3325476 . - . gene_id "ENSMUSG00000063889"; transcript_id "ENSMUST00000151311"; exon_number "2"; gene_name "Crem"; transcript_name "Crem-020";

The file came from the UCSC Table browser.
I guess that I should add the "chr" before the "18" in the above lines and probably delete the first 100lines? The first time I tried to use this file, TopHat didn't let me because it had some kind of duplicate entries. Is it possible that the first lines are problematic? Is there an easy way to add the "chr" in all the lines? I am really new to all this.
Thanks again for the quick reply and excuse me for asking so obvious questions.
Filippos
filippos is offline   Reply With Quote
Old 04-15-2011, 02:44 PM   #15
jbrwn
Member
 
Location: Denver, CO

Join Date: Mar 2011
Posts: 37
Default

oh, i should have specified that my instructions applied to human as i'm not familiar with anything associated with other organisms. my aligned reads come out of tophat as "chr1" and "chrX", which is why i treated my ensemble reference the way i did in my previous reply. i don't know what you'll need to do with NT_166433 or MT. take a look at your reads or wait till someone comes along who's worked with mice.
jbrwn is offline   Reply With Quote
Old 04-15-2013, 07:21 AM   #16
dGho
Member
 
Location: Rochester, NY

Join Date: Jan 2013
Posts: 43
Default

Quote:
Originally Posted by thinkRNA View Post
You will have to convert the ensembl ids to corresponding gene symbols. Check out biomart.
http://www.ensembl.org/biomart/martv...07ae4ce41db010
you can select ensembl gene id and gene symbols and get the file which will help you translate. This will require some programming.

Thank you so much. This is very helpful.
dGho is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:15 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO