SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
GTF/GFF3 format and new Cufflinks ega2d Bioinformatics 3 06-30-2013 06:54 PM
gff3,gtf to gff parulvk Bioinformatics 2 11-15-2011 11:48 AM
TopHat and the GFF3 file Ender985 RNA Sequencing 13 05-28-2011 06:02 PM
cuffcompare can not handle mouse gtf file from ensembl liuxq Bioinformatics 1 09-05-2010 11:54 PM
GFF3 file format for TopHat shurjo Bioinformatics 0 01-20-2010 01:37 PM

Reply
 
Thread Tools
Old 11-23-2009, 10:08 AM   #1
genec
Member
 
Location: San Francisco, CA, USA

Join Date: Oct 2009
Posts: 12
Default Ensembl gtf to gff3 for tophat

I found a number of questions about finding a gff3 format file for use by tophat and couldn't find any good answers. I found a few gff3 converters but they were part of larger packages or online tools. Since I'd prefer something simpler, I wrote the attached gtf to gff converter for use with Ensembl's gtf file.

Feel free to use, modify, or distribute as you need.

Gene
Attached Files
File Type: pl ensembl_gtf_to_gff.pl (2.1 KB, 1499 views)
genec is offline   Reply With Quote
Old 11-23-2009, 01:16 PM   #2
HTS
Member
 
Location: Toronto

Join Date: Nov 2009
Posts: 24
Default

Thanks a lot for the coding effort and for sharing your script! But are you aware of this one <http://song.cvs.sourceforge.net/viewvc/song/software/scripts/gtf2gff3/>, which has been out there for quite a while? If yes, any improvements upon it? That tool works fine for me, although it does require a large amount of memory...

-- Leo
HTS is offline   Reply With Quote
Old 11-23-2009, 01:23 PM   #3
genec
Member
 
Location: San Francisco, CA, USA

Join Date: Oct 2009
Posts: 12
Default

Yes, I had tried that gtf2gff3 script, but it wasn't working right for me. Maybe I didn't configure it correctly.

The script I posted has trivial memory requirements since it only holds one gene's worth of data in memory at once. All the exons for a gene are assumed to be located together in the gtf file, which seems to hold true for the Ensembl file. This script won't work for non-Ensembl gtf files without modification.

Gene
genec is offline   Reply With Quote
Old 11-23-2009, 01:42 PM   #4
HTS
Member
 
Location: Toronto

Join Date: Nov 2009
Posts: 24
Default

I see. Thanks for the explanation! The reason gtf2gff3 doesn't work for you is probably because you forgot to convert chromosome names in the Ensembl convention to the UCSC convention? I forgot that I also wrote a small script to do that (among other things to filter the downloaded GTF file to suit my needs) before running gtf2gff3 (with the default configuration). I guess the real difference is that gtf2gff3 doesn't assume any particular ordering of the lines so it loads everything into memory and tries to figure out appropriate gene models from there. Since Ensmbl GTF files do group things according to genes/transcripts, it is good to explore that property.
HTS is offline   Reply With Quote
Old 01-07-2010, 03:18 PM   #5
seqfast
Member
 
Location: SF Bay Area

Join Date: Aug 2008
Posts: 16
Default script looks great, need help for c elegans

Thanks for the script, looks great and works well for the human gtf. I'm working on c.elegans gtf files (from ensembl), and the ENSG* strings aren't there ... i'm not a regex expert and figured I'd ask if it was an easy fix to use the c.elegans gtf files. I like this script for it's simplicity, I could use the other one mentioned in this thread if need be. Here is a snippet, i've also attached it in case of formatting issues. Thanks!

-sf

I snoRNA exon 3747 3909 . - . gene_id "Y74C9A.6"; transcript_id "Y74C9A.6"; exon_number "1"; gene_name "Y74C9A.6"; transcript_name "NR_001477.2";
I protein_coding exon 10095 10232 . - . gene_id "Y74C9A.3"; transcript_id "Y74C9A.3.1"; exon_number "1"; gene_name "Y74C9A.3"; transcript_name "Y74C9A.3.1";
I protein_coding CDS 10095 10148 . - 0 gene_id "Y74C9A.3"; transcript_id "Y74C9A.3.1"; exon_number "1"; gene_name "Y74C9A.3"; transcript_name "Y74C9A.3.1"; protein_id "Y74C9A.3.1";
I protein_coding start_codon 10146 10148 . - 0 gene_id "Y74C9A.3"; transcript_id "Y74C9A.3.1"; exon_number "1"; gene_name "Y74C9A.3"; transcript_name "Y74C9A.3.1";
I protein_coding exon 9727 9846 . - . gene_id "Y74C9A.3"; transcript_id "Y74C9A.3.1"; exon_number "2"; gene_name "Y74C9A.3"; transcript_name "Y74C9A.3.1";
I protein_coding CDS 9727 9846 . - 0 gene_id "Y74C9A.3"; transcript_id "Y74C9A.3.1"; exon_number "2"; gene_name "Y74C9A.3"; transcript_name "Y74C9A.3.1"; protein_id "Y74C9A.3.1";
I protein_coding exon 6037 6327 . - . gene_id "Y74C9A.3"; transcript_id "Y74C9A.3.1"; exon_number "3"; gene_name "Y74C9A.3"; transcript_name "Y74C9A.3.1";
I protein_coding CDS 6037 6327 . - 0 gene_id "Y74C9A.3"; transcript_id "Y74C9A.3.1"; exon_number "3"; gene_name "Y74C9A.3"; transcript_name "Y74C9A.3.1"; protein_id "Y74C9A.3.1";
I protein_coding exon 5195 5296 . - . gene_id "Y74C9A.3"; transcript_id "Y74C9A.3.1"; exon_number "4"; gene_name "Y74C9A.3"; transcript_name "Y74C9A.3.1";
I protein_coding CDS 5195 5296 . - 0 gene_id "Y74C9A.3"; transcript_id "Y74C9A.3.1"; exon_number "4"; gene_name "Y74C9A.3"; transcript_name "Y74C9A.3.1"; protein_id "Y74C9A.3.1";
I protein_coding exon 4124 4358 . - . gene_id "Y74C9A.3"; transcript_id "Y74C9A.3.1"; exon_number "5"; gene_name "Y74C9A.3"; transcript_name "Y74C9A.3.1";
I protein_coding CDS 4224 4358 . - 0 gene_id "Y74C9A.3"; transcript_id "Y74C9A.3.1"; exon_number "5"; gene_name "Y74C9A.3"; transcript_name "Y74C9A.3.1"; protein_id "Y74C9A.3.1";
Attached Files
File Type: txt c_elegans_WS190.54.test.gtf.txt (1.9 KB, 43 views)
seqfast is offline   Reply With Quote
Old 01-08-2010, 10:01 AM   #6
genec
Member
 
Location: San Francisco, CA, USA

Join Date: Oct 2009
Posts: 12
Default

See the attached updated script. I modified it to work with your C elegans file. I believe it works, but give the output a good look to make sure that everything is processed correctly.

Gene
Attached Files
File Type: pl gtf_to_gff.pl (2.1 KB, 305 views)
genec is offline   Reply With Quote
Old 01-10-2010, 05:06 AM   #7
seqfast
Member
 
Location: SF Bay Area

Join Date: Aug 2008
Posts: 16
Default thank you!

Thanks very much, this works well. I had something similar but was getting hung up in the details. much appreciate people making these most useful scripts available - Thanks Gene,

-sf
seqfast is offline   Reply With Quote
Old 01-15-2010, 09:58 AM   #8
mdimon
Member
 
Location: San Francisco

Join Date: Jan 2010
Posts: 10
Default thank you! (and a little bug?)

Thanks for the script! The C. elegans version is great for other GTF files downloaded from UCSC also.

I did notice what appears to be a little bug:
push @trs, [@exons];
should be added before the final
process(@trs);

(I am not a perl expert, I'm more of a python type, so I may be wrong, but until I added this line the last record from the GTF file didn't get printed to the GFF3 file.)

-- Michelle
mdimon is offline   Reply With Quote
Old 01-20-2010, 09:13 AM   #9
genec
Member
 
Location: San Francisco, CA, USA

Join Date: Oct 2009
Posts: 12
Default Bug fix

That was a good catch, Michelle. I'm attaching a fixed gtf_to_gff.pl. The previous version dropped the very last gene in the gtf file.

Gene
Attached Files
File Type: pl gtf_to_gff.pl (2.1 KB, 948 views)
genec is offline   Reply With Quote
Old 01-20-2010, 11:45 AM   #10
telos
Member
 
Location: London

Join Date: Jan 2010
Posts: 11
Default MT -> chrM

You've omitted changing MT in the Ensembl GTF not to chrMT but to chrM for compatibility with TopHat.
telos is offline   Reply With Quote
Old 01-20-2010, 12:05 PM   #11
genec
Member
 
Location: San Francisco, CA, USA

Join Date: Oct 2009
Posts: 12
Default

Yeah, the MT/M thing is always an issue. Both MT and M will work, so there's not one that's right, you just have to be consistent from the beginning.

Gene
genec is offline   Reply With Quote
Old 01-20-2010, 12:43 PM   #12
telos
Member
 
Location: London

Join Date: Jan 2010
Posts: 11
Default

OK, fair enough.. I encountered the problem when comparing the SAM output with the GFF file from your script. Nothing a regexp can't solve, but it would be nice nevertheless if the file produced by your script were entirely consistent with the TopHat SAM output.
telos is offline   Reply With Quote
Old 05-27-2011, 08:37 AM   #13
edge
Senior Member
 
Location: China

Join Date: Sep 2009
Posts: 199
Default

Hi telos,

Do you know that how to specify Tophat produce accepted_hits.sam?
After I run Tophat, why it only generate accepted_hits.bam
Thanks for advice.
edge is offline   Reply With Quote
Old 05-27-2011, 08:37 AM   #14
edge
Senior Member
 
Location: China

Join Date: Sep 2009
Posts: 199
Default

Hi telos,

Do you know that how to specify Tophat produce accepted_hits.sam?
After I run Tophat, why it only generate accepted_hits.bam
Thanks for advice.
edge is offline   Reply With Quote
Old 05-27-2011, 08:48 AM   #15
chadn737
Senior Member
 
Location: US

Join Date: Jan 2009
Posts: 392
Default

Its fairly simple to convert bam to sam using samtools.

$ samtools view -h -o accepted_hits.sam accepted_hits.bam
chadn737 is offline   Reply With Quote
Old 05-27-2011, 04:20 PM   #16
edge
Senior Member
 
Location: China

Join Date: Sep 2009
Posts: 199
Default

Thanks chadn737,
In order to run Cufflink in default, is it I must include or exclude "-h" option?
eg.
Code:
samtools view input.bam > output.sam
Thanks again.
edge is offline   Reply With Quote
Old 05-29-2011, 09:15 PM   #17
reut
Member
 
Location: Israel

Join Date: Oct 2010
Posts: 19
Default you can run cufflinks with the .bam file

Quote:
Originally Posted by edge View Post
Thanks chadn737,
In order to run Cufflink in default, is it I must include or exclude "-h" option?
eg.
Code:
samtools view input.bam > output.sam
Thanks again.
You don't have to convert the accepted_hits.bam to .sam for cufflinks, it works with the bam file as well.
(which is better, since the bam file is compressed and therefore a lot smaller than the sam file)
reut is offline   Reply With Quote
Reply

Tags
converter, gff, gtf, tophat

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:31 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO