Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • gen2prot
    Member
    • Apr 2010
    • 68

    GFF to GTF

    Hello All,

    I am looking to convert a Drosophila GFF file to the GTF format. I have not found a good program to do this. The one available uses Bioperl that is not loaded on my computer and seems to have bugs according to users of that script. However, It appears that the GFF to GTF conversion should be straight-forward. I would like to know which fields are to be added in the attributes column and what features are a must in order for Tophat and Cufflinks to work with the GTF. Any inputs will be appreciated.

    Thanks
    Abhijit
  • Boel
    Member
    • Oct 2009
    • 62

    #2
    Hi Abhijit,

    It depends a bit of how you want your GTF file to look. Below is a simple example, but it adds the attributes that is required of a GTF file.

    Code:
    import sys
    
    # for gff files with only one isoform per gene. 
    
    infile = open(sys.argv[1],'r')
    outfile = open(sys.argv[2],'w')
    
    gene=''
    nE=0
    nT=0 # If there are several transcripts with the same name I need to give them a transcripts specific name. Like in Scriptures segment files.
    transcript=''
    
    for line in infile:
        if line.startswith('chr'):
            if line.split('\t')[2]=='exon': #only take lines for exons
                if line.split('\t')[8].strip('mRNA ').strip('exon ').strip(';\n')==transcript: # Another exon of the same gene? 
                    nE+=1
                    outfile.writelines('\t'.join(['\t'.join(line.split('\t')[:8]),';'.join([' '.join([' gene_id', '"'+transcript+'"']), ' '.join([' transcript_id', '"'+transcript+'"']), ' '.join([' exon_number', '"'+str(nE)+'"'+';'])])])+'\n')            
                else:
                    transcript=line.split('\t')[8].strip('mRNA ').strip('exon ').strip(';\n') # New gene
                    nE=1
                    outfile.writelines('\t'.join(['\t'.join(line.split('\t')[:8]),';'.join([' '.join([' gene_id', '"'+transcript+'"']), ' '.join([' transcript_id', '"'+transcript+'"']), ' '.join([' exon_number', '"'+str(nE)+'"'+';'])])])+'\n')
    
        
    outfile.close()
    infile.close()
    Additionally, if you have several isoforms per gene (which is usually the case) you might want to add the gene annotation as well (not included in GFF files). The above example is how I did in one specific case.

    Comment

    • gen2prot
      Member
      • Apr 2010
      • 68

      #3
      Hi Boel,

      Thanks for the reply. So which features are essential to extract. From your post and others on the net, I found the following features are essential

      gene
      CDS
      exon
      mRNA

      In addition the 9th column should have gene_id, transcript_id, and exon_number. Is that ok?? Moreover, I am confused as to how Cufflinks will report the gene expression given this GTF file. Will it report it based on types of transcripts or on the basis of gene. Do u have any idea or experience?

      Thanks
      Abhijit

      Comment

      • gen2prot
        Member
        • Apr 2010
        • 68

        #4
        Hi Boel,

        I extracted the "gene", "mRNA", "exon", and "CDS" fields from a GFF file for drosophila. Here is what it looks like.


        2L FlyBase gene 7529 9484 . + . ID=FBgn0031208;Name=CG11023;Ontology_term=SO:0000010,SO:0000087,GO:0008234,GO:0006508;Dbxref=FlyBase:FBan0011023,FlyBase_Annotation_IDs:CG11023,GB_protein:ACZ94128,GB_protein:AAO41164,GB:AI944728,GB:AJ564667,GB_protein:CAD92822,GB:BF495604,UniProt/TrEMBL:Q6KEV3,UniProt/TrEMBL:Q86BM6,INTERPRO:IPR003653,EntrezGene:33155,BIOGRID:59420,FlyAtlas:CG11023-RA,GenomeRNAi_gene:33155;gbunit=AE014134;derived_computed_cyto=21A5-21A5
        2L FlyBase mRNA 7529 9484 . + . ID=FBtr0300689;Name=CG11023-RB;Parent=FBgn0031208;Dbxref=REFSEQ:NM_001169365,FlyBase_Annotation_IDs:CG11023-RB;score_text=Weakly Supported;score=3
        2L FlyBase mRNA 7529 9484 . + . ID=FBtr0300690;Name=CG11023-RC;Parent=FBgn0031208;Dbxref=REFSEQ:NM_175941,FlyBase_Annotation_IDs:CG11023-RC;score_text=Moderately Supported;score=7
        2L FlyBase exon 7529 8116 . + . ID=FBgn0031208:1;Name=CG11023:1;Parent=FBtr0300689,FBtr0300690;parent_type=mRNA
        2L FlyBase CDS 7680 8116 . + 0 ID=CDS_FBgn0031208:1_763;Name=CG11023-cds;Parent=FBtr0300689,FBtr0300690;parent_type=mRNA
        2L FlyBase exon 8193 9484 . + . ID=FBgn0031208:3;Name=CG11023:3;Parent=FBtr0300689;parent_type=mRNA
        2L FlyBase CDS 8193 8610 . + 2 ID=CDS_FBgn0031208:3_763;Name=CG11023-cds;Parent=FBtr0300689;parent_type=mRNA
        2L FlyBase CDS 8193 8589 . + 2 ID=CDS_FBgn0031208:2_763;Name=CG11023-cds;Parent=FBtr0300690;parent_type=mRNA
        2L FlyBase exon 8193 8589 . + . ID=FBgn0031208:2;Name=CG11023:2;Parent=FBtr0300690;parent_type=mRNA
        2L FlyBase exon 8668 9484 . + . ID=FBgn0031208:4;Name=CG11023:4;Parent=FBtr0300690;parent_type=mRNA[/SIZE]

        Which identifiers need to be changed? For example I know that "ID=" should be changed to gene_id or sometimes transcript_id. When the "Parent" identifier lists two or more separate transcripts do I take that CDS/exon and duplicate it to reflect the separate transcripts? Also when writing the gene field, do I leave the transcript_id and transcript_name blank?

        Thanks
        Abhijit

        Comment

        • Boel
          Member
          • Oct 2009
          • 62

          #5
          In my GFT files I only keep the rows with exon information, and every row has gene_id, transcript_id and exon_number. This is what I run Cufflinks on.

          If you give Cufflinks a GTF where multiple isoforms (transcripts) are given, then Cufflinks will try to build those isoforms.

          Comment

          • gen2prot
            Member
            • Apr 2010
            • 68

            #6
            I saw the GTF website and it also has things like start_codon, stop_codon, and 3UTR and 5UTR entries. You say that these won't be necessary.

            Comment

            • gen2prot
              Member
              • Apr 2010
              • 68

              #7
              May I chat with you on this? I can use gmail. My address is [email protected]

              Thanks
              Abhijit

              Comment

              • gen2prot
                Member
                • Apr 2010
                • 68

                #8
                Hello,

                I have an entry in the GFF format that I am converting to GTF.

                2L FlyBase exon 38535 38731 . - . ID=FBgn0051973:13;Name=CG31973:13;Parent=FBtr0078163,FBtr0078164,FBtr0113415,FBtr0113416;parent_type=mRNA

                2L FlyBase exon 38535 38731 . - . gene_id "FBgn0051973"; transcript_id "FBtr0078163" "FBtr0078164" "FBtr0113415" "FBtr0113416"; exon_number "13"; gene_name "CG31973";

                Is this a valid conversion?

                Thanks
                Abhijit

                Comment

                • seqguy
                  Junior Member
                  • Oct 2010
                  • 8

                  #9
                  Dear Abhijit,

                  why do u want to convert GFF to GTF, when you can easly download the GTF file itself from UCSC genome browser for drosophila

                  Comment

                  • gen2prot
                    Member
                    • Apr 2010
                    • 68

                    #10
                    The newest release is 5.32. The UCSC GTF files are from 2006. A lot of IDs have changed.

                    Comment

                    Latest Articles

                    Collapse

                    • SEQadmin2
                      Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                      by SEQadmin2


                      I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


                      Here are nine questions we think about, in roughly the order they matter, before...
                      Yesterday, 07:11 AM
                    • SEQadmin2
                      From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                      by SEQadmin2


                      Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                      The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                      ...
                      06-02-2026, 10:05 AM
                    • SEQadmin2
                      Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                      by SEQadmin2


                      With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                      Introduction

                      Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                      05-22-2026, 06:42 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by SEQadmin2, 06-17-2026, 06:09 AM
                    0 responses
                    20 views
                    0 reactions
                    Last Post SEQadmin2  
                    Started by SEQadmin2, 06-09-2026, 11:58 AM
                    0 responses
                    38 views
                    0 reactions
                    Last Post SEQadmin2  
                    Started by SEQadmin2, 06-05-2026, 10:09 AM
                    0 responses
                    44 views
                    0 reactions
                    Last Post SEQadmin2  
                    Started by SEQadmin2, 06-04-2026, 08:59 AM
                    0 responses
                    49 views
                    0 reactions
                    Last Post SEQadmin2  
                    Working...