Seqanswers Leaderboard Ad

**Boel** · 12-09-2010, 02:38 PM

Hi Abhijit,

It depends a bit of how you want your GTF file to look. Below is a simple example, but it adds the attributes that is required of a GTF file.

Code:

import sys

# for gff files with only one isoform per gene. 

infile = open(sys.argv[1],'r')
outfile = open(sys.argv[2],'w')

gene=''
nE=0
nT=0 # If there are several transcripts with the same name I need to give them a transcripts specific name. Like in Scriptures segment files.
transcript=''

for line in infile:
    if line.startswith('chr'):
        if line.split('\t')[2]=='exon': #only take lines for exons
            if line.split('\t')[8].strip('mRNA ').strip('exon ').strip(';\n')==transcript: # Another exon of the same gene? 
                nE+=1
                outfile.writelines('\t'.join(['\t'.join(line.split('\t')[:8]),';'.join([' '.join([' gene_id', '"'+transcript+'"']), ' '.join([' transcript_id', '"'+transcript+'"']), ' '.join([' exon_number', '"'+str(nE)+'"'+';'])])])+'\n')            
            else:
                transcript=line.split('\t')[8].strip('mRNA ').strip('exon ').strip(';\n') # New gene
                nE=1
                outfile.writelines('\t'.join(['\t'.join(line.split('\t')[:8]),';'.join([' '.join([' gene_id', '"'+transcript+'"']), ' '.join([' transcript_id', '"'+transcript+'"']), ' '.join([' exon_number', '"'+str(nE)+'"'+';'])])])+'\n')

    
outfile.close()
infile.close()

Additionally, if you have several isoforms per gene (which is usually the case) you might want to add the gene annotation as well (not included in GFF files). The above example is how I did in one specific case.

**gen2prot** · 12-13-2010, 11:35 AM

Hi Boel,

Thanks for the reply. So which features are essential to extract. From your post and others on the net, I found the following features are essential

gene
CDS
exon
mRNA

In addition the 9th column should have gene_id, transcript_id, and exon_number. Is that ok?? Moreover, I am confused as to how Cufflinks will report the gene expression given this GTF file. Will it report it based on types of transcripts or on the basis of gene. Do u have any idea or experience?

Thanks
Abhijit

**gen2prot** · 12-13-2010, 01:06 PM

Hi Boel,

I extracted the "gene", "mRNA", "exon", and "CDS" fields from a GFF file for drosophila. Here is what it looks like.

2L FlyBase gene 7529 9484 . + . ID=FBgn0031208;Name=CG11023;Ontology_term=SO:0000010,SO:0000087,GO:0008234,GO:0006508;Dbxref=FlyBase:FBan0011023,FlyBase_Annotation_IDs:CG11023,GB_protein:ACZ94128,GB_protein:AAO41164,GB:AI944728,GB:AJ564667,GB_protein:CAD92822,GB:BF495604,UniProt/TrEMBL:Q6KEV3,UniProt/TrEMBL:Q86BM6,INTERPRO:IPR003653,EntrezGene:33155,BIOGRID:59420,FlyAtlas:CG11023-RA,GenomeRNAi_gene:33155;gbunit=AE014134;derived_computed_cyto=21A5-21A5
2L FlyBase mRNA 7529 9484 . + . ID=FBtr0300689;Name=CG11023-RB;Parent=FBgn0031208;Dbxref=REFSEQ:NM_001169365,FlyBase_Annotation_IDs:CG11023-RB;score_text=Weakly Supported;score=3
2L FlyBase mRNA 7529 9484 . + . ID=FBtr0300690;Name=CG11023-RC;Parent=FBgn0031208;Dbxref=REFSEQ:NM_175941,FlyBase_Annotation_IDs:CG11023-RC;score_text=Moderately Supported;score=7
2L FlyBase exon 7529 8116 . + . ID=FBgn0031208:1;Name=CG11023:1;Parent=FBtr0300689,FBtr0300690;parent_type=mRNA
2L FlyBase CDS 7680 8116 . + 0 ID=CDS_FBgn0031208:1_763;Name=CG11023-cds;Parent=FBtr0300689,FBtr0300690;parent_type=mRNA
2L FlyBase exon 8193 9484 . + . ID=FBgn0031208:3;Name=CG11023:3;Parent=FBtr0300689;parent_type=mRNA
2L FlyBase CDS 8193 8610 . + 2 ID=CDS_FBgn0031208:3_763;Name=CG11023-cds;Parent=FBtr0300689;parent_type=mRNA
2L FlyBase CDS 8193 8589 . + 2 ID=CDS_FBgn0031208:2_763;Name=CG11023-cds;Parent=FBtr0300690;parent_type=mRNA
2L FlyBase exon 8193 8589 . + . ID=FBgn0031208:2;Name=CG11023:2;Parent=FBtr0300690;parent_type=mRNA
2L FlyBase exon 8668 9484 . + . ID=FBgn0031208:4;Name=CG11023:4;Parent=FBtr0300690;parent_type=mRNA[/SIZE]

Which identifiers need to be changed? For example I know that "ID=" should be changed to gene_id or sometimes transcript_id. When the "Parent" identifier lists two or more separate transcripts do I take that CDS/exon and duplicate it to reflect the separate transcripts? Also when writing the gene field, do I leave the transcript_id and transcript_name blank?

Thanks
Abhijit

**Boel** · 12-13-2010, 01:20 PM

In my GFT files I only keep the rows with exon information, and every row has gene_id, transcript_id and exon_number. This is what I run Cufflinks on.

If you give Cufflinks a GTF where multiple isoforms (transcripts) are given, then Cufflinks will try to build those isoforms.

**gen2prot** · 12-13-2010, 01:23 PM

I saw the GTF website and it also has things like start_codon, stop_codon, and 3UTR and 5UTR entries. You say that these won't be necessary.

**gen2prot** · 12-13-2010, 01:28 PM

May I chat with you on this? I can use gmail. My address is [email protected]

Thanks
Abhijit

**gen2prot** · 12-14-2010, 08:46 AM

Hello,

I have an entry in the GFF format that I am converting to GTF.

2L FlyBase exon 38535 38731 . - . ID=FBgn0051973:13;Name=CG31973:13;Parent=FBtr0078163,FBtr0078164,FBtr0113415,FBtr0113416;parent_type=mRNA

2L FlyBase exon 38535 38731 . - . gene_id "FBgn0051973"; transcript_id "FBtr0078163" "FBtr0078164" "FBtr0113415" "FBtr0113416"; exon_number "13"; gene_name "CG31973";

Is this a valid conversion?

Thanks
Abhijit

**seqguy** · 12-14-2010, 10:49 AM

Dear Abhijit,

why do u want to convert GFF to GTF, when you can easly download the GTF file itself from UCSC genome browser for drosophila

**gen2prot** · 12-14-2010, 11:07 AM

The newest release is 5.32. The UCSC GTF files are from 2006. A lot of IDs have changed.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 18 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 17 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 49 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

GFF to GTF

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News