SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
mouse GTF file honey Bioinformatics 0 10-11-2011 05:43 AM
GTF file jy123 RNA Sequencing 1 05-11-2011 10:49 AM
GTF file for maize? ameyer RNA Sequencing 5 04-19-2011 06:35 AM
Arabidopsis GTF for cufflinks dnusol Bioinformatics 0 02-07-2011 03:52 AM
GTF file for cuffdiff 0.9.1 kalidaemon Bioinformatics 7 11-15-2010 09:26 AM

Reply
 
Thread Tools
Old 09-17-2009, 06:10 AM   #1
SOLiD_User
Junior Member
 
Location: New Jersey

Join Date: Aug 2009
Posts: 7
Default gtf file for arabidopsis

Does anyone know where I can find a gtf file for arabidopsis or a program that I can use to create one?

Thanks!
SOLiD_User is offline   Reply With Quote
Old 09-17-2009, 09:47 AM   #2
steven
Senior Member
 
Location: Southern France

Join Date: Aug 2009
Posts: 269
Default

a gtf file of what?
for gene annotation, maybe the TAIR or the TIGR
steven is offline   Reply With Quote
Old 09-17-2009, 10:06 AM   #3
SOLiD_User
Junior Member
 
Location: New Jersey

Join Date: Aug 2009
Posts: 7
Default

Yes, a gtf for gene annotation. I checked TAIR, TIGR, EMBL, etc and so far have been unable to locate a gtf file. I can only find gff files for arabidopsis. It seems EMBL has gtf files for everything except plants. I looked at gbrowse and the UCSC genome browser and I don't see a way to export as a gtf file. I have spent much time with google and I haven't found anything useful.
SOLiD_User is offline   Reply With Quote
Old 09-17-2009, 10:18 AM   #4
steven
Senior Member
 
Location: Southern France

Join Date: Aug 2009
Posts: 269
Default

Why isn't gff ok? Are you looking for a specific field like "transcript_id:" or so?
steven is offline   Reply With Quote
Old 09-17-2009, 11:08 AM   #5
SOLiD_User
Junior Member
 
Location: New Jersey

Join Date: Aug 2009
Posts: 7
Default

I'm trying to get the AB whole transcriptome pipeline working and it requires the transcript_id and gene_id fields in the gtf file. I tried the gff and it didn't work. I don't have the programming skills to create a perl script so I was hoping I could download a gtf file or find an application that could create one.
SOLiD_User is offline   Reply With Quote
Old 09-17-2009, 11:38 AM   #6
steven
Senior Member
 
Location: Southern France

Join Date: Aug 2009
Posts: 269
Default

could you show me a few lines of the gff file(s) you have, just in case the information is available and easy to convert?
steven is offline   Reply With Quote
Old 09-17-2009, 12:12 PM   #7
SOLiD_User
Junior Member
 
Location: New Jersey

Join Date: Aug 2009
Posts: 7
Default

The gff files look like this. I think the major challenge in converting gff to gtf is counting the exons for each transcript.
--
Chr1 TAIR9 CDS 3760 3913 . + 0 Parent=AT1G01010.1,AT1G01010.1-Protein;
Chr1 TAIR9 exon 3996 4276 . + . Parent=AT1G01010.1
Chr1 TAIR9 CDS 3996 4276 . + 2 Parent=AT1G01010.1,AT1G01010.1-Protein;
Chr1 TAIR9 exon 4486 4605 . + . Parent=AT1G01010.1
Chr1 TAIR9 CDS 4486 4605 . + 0 Parent=AT1G01010.1,AT1G01010.1-Protein;
Chr1 TAIR9 exon 4706 5095 . + . Parent=AT1G01010.1
Chr1 TAIR9 CDS 4706 5095 . + 0 Parent=AT1G01010.1,AT1G01010.1-Protein;
Chr1 TAIR9 exon 5174 5326 . + . Parent=AT1G01010.1
Chr1 TAIR9 CDS 5174 5326 . + 0 Parent=AT1G01010.1,AT1G01010.1-Protein;
Chr1 TAIR9 exon 5439 5899 . + . Parent=AT1G01010.1
Chr1 TAIR9 CDS 5439 5630 . + 0 Parent=AT1G01010.1,AT1G01010.1-Protein;

the gtf file has to be in this format

supercont1.1 protein_coding CDS 2191663 2191958 . - 1 gene_id "AAEL000037"; transcript_id "AAEL000037-RA"; exon_number "4"; protein_id "AAEL000037-PA";
supercont1.1 protein_coding exon 2191201 2191600 . - . gene_id "AAEL000037"; transcript_id "AAEL000037-RA"; exon_number "5";
supercont1.1 protein_coding CDS 2191299 2191600 . - 2 gene_id "AAEL000037"; transcript_id "AAEL000037-RA"; exon_number "5"; protein_id "AAEL000037-PA";
supercont1.1 protein_coding stop_codon 2191296 2191298 . - 0 gene_id "AAEL000037"; transcript_id "AAEL000037-RA"; exon_number "5";
supercont1.1 protein_coding exon 2207362 2207580 . - . gene_id "AAEL000086"; transcript_id "AAEL000086-RA"; exon_number "1";
supercont1.1 protein_coding CDS 2207362 2207580 . - 0 gene_id "AAEL000086"; transcript_id "AAEL000086-RA"; exon_number "1"; protein_id "AAEL000086-PA";
supercont1.1 protein_coding start_codon 2207578 2207580 . - 0 gene_id "AAEL000086"; transcript_id "AAEL000086-RA"; exon_number "1";
supercont1.1 protein_coding exon 2207263 2207299 . - . gene_id "AAEL000086"; transcript_id "AAEL000086-RA"; exon_number "2";
SOLiD_User is offline   Reply With Quote
Old 09-21-2009, 06:27 AM   #8
andreas.sjodin
Member
 
Location: Umeň, Sweden

Join Date: Apr 2009
Posts: 25
Default

I would recommend you to take a look at the Python GFF parsers developed by Brad Chapman. It can be downloaded form GitHub (http://github.com/chapmanb/bcbb/tree/master/gff/). Those script can convert between different types of GFF versions. More information about his script is found at in some blog posts (http://bcbio.wordpress.com)
andreas.sjodin is offline   Reply With Quote
Old 09-22-2009, 04:40 AM   #9
SOLiD_User
Junior Member
 
Location: New Jersey

Join Date: Aug 2009
Posts: 7
Default

Thanks Andreas. I'll have a look at the GFF parser.
SOLiD_User is offline   Reply With Quote
Old 12-03-2009, 09:37 AM   #10
knc
Junior Member
 
Location: SD

Join Date: Apr 2008
Posts: 3
Default gtf for arabidopsis

Hi,

Did you solve your gtf problem? You can use the gff for the first 8 fields, the last field needs to be changed to include the gene_id, transcript_id, and exon #.
knc is offline   Reply With Quote
Old 12-03-2009, 10:58 AM   #11
dsidote
Member
 
Location: New Jersey

Join Date: Aug 2009
Posts: 23
Default

We were able to get what we needed. Thanks!
dsidote is offline   Reply With Quote
Old 12-03-2009, 01:27 PM   #12
knc
Junior Member
 
Location: SD

Join Date: Apr 2008
Posts: 3
Default

Quick question:
How did you deal with transcripts that have different stop codons?
knc is offline   Reply With Quote
Old 12-03-2009, 05:09 PM   #13
dsidote
Member
 
Location: New Jersey

Join Date: Aug 2009
Posts: 23
Default

I'm not sure I understand what your asking. Are you referring to splice variants? If so, we wrote a perl script reads the file line by line and counted exons for each transcript. I can send it to you if it would help. In it's current form it only works for the gff file from TAIR.
dsidote is offline   Reply With Quote
Old 12-10-2009, 01:08 PM   #14
knc
Junior Member
 
Location: SD

Join Date: Apr 2008
Posts: 3
Default

Sure that would be great. We are having issues with our gtf file... The file format you refer to seems a bit different from the description on the cufflinks site (http://mblab.wustl.edu/GTF22.html). Did you validate your gtf file? We get errors when we do. Thanks for your help!
knc is offline   Reply With Quote
Old 05-03-2010, 12:36 AM   #15
saha
Junior Member
 
Location: india

Join Date: Jan 2010
Posts: 5
Default

dear all,

i am facing similar problem. i am very much in need of tigr rice genome v6.0 but not able to get it yet. i want to utilize this gtf file as refgene list to upload on broad institute's IGV browser.
Any help is appreciable.

regards,
Saha
saha is offline   Reply With Quote
Old 05-04-2010, 10:21 AM   #16
mattanswers
Member
 
Location: Boston

Join Date: Oct 2009
Posts: 54
Default

I would also be very interested in generating a gtf file for arabidopsis. Would it be possible to post the code ?
mattanswers is offline   Reply With Quote
Old 09-06-2010, 09:40 PM   #17
celeste8
Junior Member
 
Location: China

Join Date: Aug 2010
Posts: 2
Default

Download TAIR10_GFF3_genes.gff inTAIR, and then use the Perl script below to convert this gff file to gtf which is used as conference annotation Cufflinks.

#!/usr/bin/perl

use warnings;
use strict;
while (<>) {
chomp;
my @parts = split /\t/;
if ( !(($parts[2] eq 'three_prime_UTR')|($parts[2] eq 'five_prime_UTR')|($parts[2] eq 'exon')|($parts[2] eq 'CDS')) ) {
next;
}
elsif ( $parts[2] eq 'three_prime_UTR') {
$parts[2] = '3UTR';
}
elsif ($parts[2] eq 'five_prime_UTR') {
$parts[2] = '5UTR';
}
$parts[8] =~ s/^Parent=((\w+)\.\w*).*/gene_id \"$2\"; transcript_id \"$1\";/sg;
$_ = join "\t" , @parts;
print "$_\n";
}
celeste8 is offline   Reply With Quote
Old 09-07-2010, 09:33 AM   #18
mattanswers
Member
 
Location: Boston

Join Date: Oct 2009
Posts: 54
Default

celeste8, thank you very much !
mattanswers is offline   Reply With Quote
Old 12-27-2013, 07:49 AM   #19
samhokin
Member
 
Location: Madison, WI

Join Date: Nov 2013
Posts: 11
Default

Thanks, celeste8!!
__________________
Sam Hokin
Computational Scientist, Department of Plant Biology, Carnegie Institution for Science, Stanford
samhokin is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:44 PM.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.