SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
converting GFF to GTF efoss Bioinformatics 8 10-15-2013 06:06 AM
GFF to GTF, and GTF to GRanges objects lewewoo Bioinformatics 2 04-03-2012 03:52 PM
gff3,gtf to gff parulvk Bioinformatics 2 11-15-2011 12:48 PM
GFF to GTF gen2prot Bioinformatics 9 12-14-2010 11:07 AM
merging a tab and a fasta file arg General 2 10-21-2010 11:53 AM

Reply
 
Thread Tools
Old 03-26-2010, 08:29 PM   #1
DrD2009
Member
 
Location: Kansas City

Join Date: Oct 2009
Posts: 88
Default Tab Delimited File Editors? (GFF to GTF)

Hello everyone,

I have a bunch of GFFs that I would like to convert into GTF format in order to provide annotation for use in Cufflinks.

Can anyone recommend a tab delimited file editor I could use to do this? I'm not a programmer so if there is coding necessary it would have to be very basic. I've tried using Galaxy, but it changes the data I enter (mainly: "" ).

Thanks,
Brandon
DrD2009 is offline   Reply With Quote
Old 03-29-2010, 11:14 AM   #2
mgogol
Senior Member
 
Location: Kansas City

Join Date: Mar 2008
Posts: 197
Default

How about this?

http://www.sequenceontology.org/cgi-bin/converter.cgi

Oh, sorry. I read it wrong. You want to go in the other direction.

Last edited by mgogol; 03-29-2010 at 01:34 PM.
mgogol is offline   Reply With Quote
Old 07-28-2010, 08:04 AM   #3
James
Member
 
Location: Cardiff

Join Date: Mar 2010
Posts: 23
Default

Hi Brandon,

Did you find a simple way to convert GFF to GTF. I want to do exactly the same thing, I am also not a programmer.

Thanks, J
James is offline   Reply With Quote
Old 07-28-2010, 08:26 AM   #4
mgogol
Senior Member
 
Location: Kansas City

Join Date: Mar 2008
Posts: 197
Default

You can try my perl script. I used this with a flybase gff file. Note that if you want to represent ncRNAs, tRNAs, rRNAs, snRNAs, miRNAs, you'll have to manually change them to "mRNA" in the gff file or modify the script.

This also expects all the mRNA entries to come before the exons. If your gff file isn't ordered like that, you can grep the mRNAs out and then the exons and cat them together.


Code:
#!/usr/bin/env perl
###############################
# gff2gtf.pl
#
# parses mRNA, exon lines from a gff file and prints gtf lines (for cufflinks) 
# 5/2010
# #############################

use Bio::Tools::GFF;
  
my $parser = new Bio::Tools::GFF->new(-file=> $ARGV[0], -gff_version => 3);

my %hash;
while( my $result = $parser->next_feature ) 
{
	($id,@junk)= $result->get_tag_values("ID");
	$type = $result->primary_tag();

	if(!$result)
	{
		last;
	}

	$seq_id = $result->seq_id();
	$strand = $result->strand();
	$strand =~ s/-1/-/g;
	$strand =~ s/1/+/g;
	$start = $result->start();
	$end = $result->end();

	if($type eq "mRNA")
	{
		($parent,@junk)= $result->get_tag_values("Parent");
		$hash{$id} = $parent;
	}
	if($type eq "exon")
	{
		#find out transcript (parent) and gene for THIS exon
		($parent,@junk)= $result->get_tag_values("Parent");
		$transcript = $parent;
		$gene = $hash{$transcript};	
		print "$seq_id\tFlyBase\t$type\t$start\t$end\t.\t$strand\t.\tgene_id \"$gene\";transcript_id \"$transcript\";\n";
	}
}
mgogol is offline   Reply With Quote
Old 07-31-2010, 12:42 PM   #5
James
Member
 
Location: Cardiff

Join Date: Mar 2010
Posts: 23
Default

Thanks for that. Sorry I am a newbie to this and perl. How would I go about changing the script to suit my data?

This is my data:

Quote:
DDB0232428 Sequencing Center mRNA 1890 3287 . + . ID=DDB0216437;Parent=DDB_G0267178;Name=DDB0216437;description=JC1V2_0_00003: Obtained from the Dictyostelium Genome Consortium at The Wellcome Trust Sanger Institute;translation_start=1;Dbxref=Protein Accession Version:EAL73826.1,Inparanoid V. 5.1:DDB0216437,UniProt:Q55H43,Genome V. 2.0 ID:JC1V2_0_00003,Protein Accession Number:EAL73826.1,Protein GI Number:60475899
DDB0232428 Sequencing Center mRNA 3848 4855 . + . ID=DDB0216438;Parent=DDB_G0267180;Name=DDB0216438;description=JC1V2_0_00004: Obtained from the Dictyostelium Genome Consortium at The Wellcome Trust Sanger Institute;translation_start=1;Dbxref=Protein Accession Version:EAL73827.1,Inparanoid V. 5.1:DDB0216438,UniProt:Q55H42,Genome V. 2.0 ID:JC1V2_0_00004,Protein Accession Number:EAL73827.1,Protein GI Number:60475900
DDB0232428 Sequencing Center mRNA 5505 7769 . + . ID=DDB0216439;Parent=DDB_G0267182;Name=DDB0216439;description=JC1V2_0_00005: Obtained from the Dictyostelium Genome Consortium at The Wellcome Trust Sanger Institute;translation_start=1;Dbxref=Protein Accession Version:EAL73828.1,Inparanoid V. 5.1:DDB0216439,UniProt:Q55H60,Genome V. 2.0 ID:JC1V2_0_00005,Protein Accession Number:EAL73828.1,Protein GI Number:60475901
DDB0232428 Sequencing Center mRNA 8308 9522 . - . ID=DDB0216440;Parent=DDB_G0267184;Name=DDB0216440;description=JC1V2_0_00006: Obtained from the Dictyostelium Genome Consortium at The Wellcome Trust Sanger Institute;translation_start=1;Dbxref=Protein Accession Version:EAL73829.1,Inparanoid V. 5.1:DDB0216440,UniProt:Q55H61,Genome V. 2.0 ID:JC1V2_0_00006,Protein Accession Number:EAL73829.1,Protein GI Number:60475902
DDB0232428 Sequencing Center mRNA 9635 9889 . - . ID=DDB0216441;Parent=DDB_G0267186;Name=DDB0216441;description=JC1V2_0_00007: Obtained from the Dictyostelium Genome Consortium at The Wellcome Trust Sanger Institute;translation_start=1;Dbxref=Protein Accession Version:EAL73830.1,Inparanoid V. 5.1:DDB0216441,UniProt:Q55H59,Genome V. 2.0 ID:JC1V2_0_00007,Protein Accession Number:EAL73830.1,Protein GI Number:60475903
followed by exons after the mRNAs

Quote:
DDB0232428 Sequencing Center exon 1890 3287 . + . Parent=DDB0216437
DDB0232428 Sequencing Center exon 3848 4855 . + . Parent=DDB0216438
DDB0232428 Sequencing Center exon 5505 7769 . + . Parent=DDB0216439
DDB0232428 Sequencing Center exon 8308 9522 . - . Parent=DDB0216440
DDB0232428 Sequencing Center exon 9635 9889 . - . Parent=DDB0216441
I get this error:

James$ perl gff2gtf.pl chrm1_mRNA_exon.gff > chrm1.gtf

------------- EXCEPTION -------------
MSG: asking for tag value that does not exist ID
STACK Bio::SeqFeature::Generic::get_tag_values Bio/SeqFeature/Generic.pm:517
STACK toplevel gff2gtf.pl:16
-------------------------------------

Thanks alot, James

Last edited by James; 07-31-2010 at 12:43 PM. Reason: edit details
James is offline   Reply With Quote
Old 07-31-2010, 12:48 PM   #6
James
Member
 
Location: Cardiff

Join Date: Mar 2010
Posts: 23
Default

oh DDB0232428 is chrm1. I'll change that to chrm1 with sed.
James is offline   Reply With Quote
Old 09-13-2010, 05:06 PM   #7
BrittLF
Junior Member
 
Location: La Jolla, CA

Join Date: Sep 2010
Posts: 2
Default Similar Problem

Hi!
I'm experiencing a similar problem. I have a .gff file for my organism (Anabaena sp. strain 7120) and would like to convert it to a .gtf to upload with the software cufflinks.

My current format looks like this:
##gff-version 3
#!gff-spec-version 1.14
#!source-version NCBI C++ formatter 0.2
##Type DNA BA000019.2
BA000019.2 DDBJ source 1 6413771 . + . organism=Nostoc sp. PCC 7120;mol_type=genomic DNA;strain=PCC 7120;db_xref=taxon:103690;note=synonym: Anabaena sp. PCC 7120
BA000019.2 DDBJ gene 1 918 . - . ID=BA000019.2:all0001
BA000019.2 DDBJ gene 6413460 6413771 . - . ID=BA000019.2:all0001
BA000019.2 DDBJ CDS 1 918 . - 0 note=all0001%3B ORF_ID:all0001%3B%0Aunknown protein;transl_table=11;protein_id=BAB77525.1;db_xref=GI:55420319;exon_number=1
BA000019.2 DDBJ CDS 6413463 6413771 . - 0 note=all0001%3B ORF_ID:all0001%3B%0Aunknown protein;transl_table=11;protein_id=BAB77525.1;db_xref=GI:55420319;exon_number=2
BA000019.2 DDBJ start_codon 916 918 . - 0 note=all0001%3B ORF_ID:all0001%3B%0Aunknown protein;transl_table=11;protein_id=BAB77525.1;db_xref=GI:55420319;exon_number=1

and I need this:
AB000381 Twinscan CDS 380 401 . + 0 gene_id "001"; transcript_id "001.1";
AB000381 Twinscan CDS 501 650 . + 2 gene_id "001"; transcript_id "001.1";
AB000381 Twinscan CDS 700 707 . + 2 gene_id "001"; transcript_id "001.1";
AB000381 Twinscan start_codon 380 382 . + 0 gene_id "001"; transcript_id "001.1";
AB000381 Twinscan stop_codon 708 710 . + 0 gene_id "001"; transcript_id "001.1";

I tried a couple gff to gtf perl converters like this one by the ninth column never comes out right. Any help would be great.
Thanks!
Britt
BrittLF is offline   Reply With Quote
Old 09-14-2010, 10:03 AM   #8
BrittLF
Junior Member
 
Location: La Jolla, CA

Join Date: Sep 2010
Posts: 2
Default bump ?

any help would be great!
BrittLF is offline   Reply With Quote
Old 09-14-2010, 11:00 AM   #9
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by BrittLF View Post
any help would be great!
Please do not bump your threads. Give it some more time and some people may answer your questions. Otherwise, keep searching.
nilshomer is offline   Reply With Quote
Old 10-18-2010, 03:31 PM   #10
jbittner
Junior Member
 
Location: Torrance, CA

Join Date: Oct 2010
Posts: 3
Default

Hi,
I am also trying to convert a gff file to gtf, and am using the gff2gtf.pl script. However, I'm getting an error about the length of each line in the file:

Quote:
------------- EXCEPTION -------------
MSG: Each line of the fasta entry must be the same length except the last.
Line above #3 'LbrM01_V2_October Ge..' is 87 != 100 chars.
STACK Bio:: DB::Fasta::calculate_offsets /Users/jaimebittner/BioPerl-1.6.1//Bio/DB/Fasta.pm:770
STACK Bio:: DB::Fasta::index_file /Users/jaimebittner/BioPerl-1.6.1//Bio/DB/Fasta.pm:680
STACK Bio:: DB::Fasta::new /Users/jaimebittner/BioPerl-1.6.1//Bio/DB/Fasta.pm:491
STACK toplevel gff2gtf.pl:20
-------------------------------------

indexing was interrupted, so unlinking L_braziliensis.gff.index at /Users/jaimebittner/BioPerl-1.6.1//Bio/DB/Fasta.pm line 1053.
The attribute column (the last column) differs for each line:
Quote:
LbrM01_V2_October GeneDB Contig 1 235333 . + . Sequence LbrM01_V2_October ; Alias LbrM01_V2_October
LbrM01_V2_October GeneDB source 1 235333 . + . source unknown_1 ; origid "Lbr.chr1" ;
LbrM01_V2_October GeneDB source 1 235333 . + . source unknown_2 ; origid "Lbr.chr1" ;
LbrM01_V2_October GeneDB CDS_parts 1272 4166 . - . mRNA LbrM01_V2.0010 ; temporary_systematic_id "LbrM01_V2.0010" ; colour "8" ; ortholog "GeneDB_Lmajor:LmjF01.0630 ||| GeneDB_Linfantum:LinJ01_V3.0650;predicted_by_orthomcl ||| GeneDB_Lmajor:LmjF01.0630;predicted_by_orthomcl" ; product "hypothetical protein, unknown function" ;
LbrM01_V2_October GeneDB CDS 1272 4166 . - . mRNA LbrM01_V2.0010 ; colour "8" ;
but I don't know how to fix this. Is there something I can use to cut down the length of the attributes to an equal number of characters?

thank you!

Last edited by jbittner; 10-18-2010 at 03:32 PM. Reason: :D made a smiley face when posted
jbittner is offline   Reply With Quote
Old 10-19-2010, 06:58 AM   #11
mgogol
Senior Member
 
Location: Kansas City

Join Date: Mar 2008
Posts: 197
Default

Maybe you can get rid of some of the irrelevant lines? grep for mRNA and exon and make a new file only containing those lines? If you put your file up somewhere maybe I could take a look at it.

Same with other people having problems.

The errors are from Bioperl, so I'm having trouble figuring out what they mean, I'd have to do more testing with the script.
mgogol is offline   Reply With Quote
Old 10-19-2010, 11:12 AM   #12
jbittner
Junior Member
 
Location: Torrance, CA

Join Date: Oct 2010
Posts: 3
Default

Thank you for the idea, I am sort of new to this so any advice really helps.

I got the GFF file off of the Sanger FTP site, and it's for the parasite Leishmania braziliensis. It's too big to upload to the forum even when I compress it. Is there another way I can get it to you?

Here is the link for where I got it ftp://ftp.sanger.ac.uk/pub/pathogens/L_braziliensis/ (I connected as "guest", then found it through the folders Datasets/GFF)
jbittner is offline   Reply With Quote
Old 10-19-2010, 11:24 AM   #13
mgogol
Senior Member
 
Location: Kansas City

Join Date: Mar 2008
Posts: 197
Default

That GFF file doesn't have exon entries and the last column doesn't have an ID tag... Do you have a source for exon level information?

If you don't, you could try running without a gtf file, and just trying to let cufflinks define it's own transcripts.
mgogol is offline   Reply With Quote
Old 10-19-2010, 11:47 AM   #14
jbittner
Junior Member
 
Location: Torrance, CA

Join Date: Oct 2010
Posts: 3
Default

Unfortunately the only exon level information that we have found is in a .cds file and I haven't found any ways to convert this to GFF or GTF, I don't even know what that file extension means. (I found it in the same FTP site).

Also, we are ultimately trying to get a refflat file to use with DEGseq, and so converting our gff to gtf file was just an intermediate step in that process.

I really appreciate your help
jbittner is offline   Reply With Quote
Old 10-19-2010, 12:44 PM   #15
mgogol
Senior Member
 
Location: Kansas City

Join Date: Mar 2008
Posts: 197
Default

Um. I don't know either. The cds file doesn't seem to have exon information. I've got to get back to my own work now... Good luck.
mgogol is offline   Reply With Quote
Old 11-25-2011, 12:39 AM   #16
msutada@gmail.com
Junior Member
 
Location: japan

Join Date: May 2011
Posts: 6
Default

Has anyone know the script that really convert gff3 to gtf2.2? I have tried so far and none of them gave the corrected format?

Any suggestion will be great.
Best,
MS
msutada@gmail.com is offline   Reply With Quote
Old 11-30-2012, 05:52 AM   #17
BobFreemanMA
Junior Member
 
Location: Boston, MA

Join Date: Jan 2011
Posts: 8
Default

Check out the gffread utility as a part of cufflinks programs. One of the options allows you to read from GFF3 and convert to GTF. See the info at http://cufflinks.cbcb.umd.edu/gff.html

-Bob
BobFreemanMA is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:25 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO