Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • GFF to GTF

    Hello All,

    I am looking to convert a Drosophila GFF file to the GTF format. I have not found a good program to do this. The one available uses Bioperl that is not loaded on my computer and seems to have bugs according to users of that script. However, It appears that the GFF to GTF conversion should be straight-forward. I would like to know which fields are to be added in the attributes column and what features are a must in order for Tophat and Cufflinks to work with the GTF. Any inputs will be appreciated.

    Thanks
    Abhijit

  • #2
    Hi Abhijit,

    It depends a bit of how you want your GTF file to look. Below is a simple example, but it adds the attributes that is required of a GTF file.

    Code:
    import sys
    
    # for gff files with only one isoform per gene. 
    
    infile = open(sys.argv[1],'r')
    outfile = open(sys.argv[2],'w')
    
    gene=''
    nE=0
    nT=0 # If there are several transcripts with the same name I need to give them a transcripts specific name. Like in Scriptures segment files.
    transcript=''
    
    for line in infile:
        if line.startswith('chr'):
            if line.split('\t')[2]=='exon': #only take lines for exons
                if line.split('\t')[8].strip('mRNA ').strip('exon ').strip(';\n')==transcript: # Another exon of the same gene? 
                    nE+=1
                    outfile.writelines('\t'.join(['\t'.join(line.split('\t')[:8]),';'.join([' '.join([' gene_id', '"'+transcript+'"']), ' '.join([' transcript_id', '"'+transcript+'"']), ' '.join([' exon_number', '"'+str(nE)+'"'+';'])])])+'\n')            
                else:
                    transcript=line.split('\t')[8].strip('mRNA ').strip('exon ').strip(';\n') # New gene
                    nE=1
                    outfile.writelines('\t'.join(['\t'.join(line.split('\t')[:8]),';'.join([' '.join([' gene_id', '"'+transcript+'"']), ' '.join([' transcript_id', '"'+transcript+'"']), ' '.join([' exon_number', '"'+str(nE)+'"'+';'])])])+'\n')
    
        
    outfile.close()
    infile.close()
    Additionally, if you have several isoforms per gene (which is usually the case) you might want to add the gene annotation as well (not included in GFF files). The above example is how I did in one specific case.

    Comment


    • #3
      Hi Boel,

      Thanks for the reply. So which features are essential to extract. From your post and others on the net, I found the following features are essential

      gene
      CDS
      exon
      mRNA

      In addition the 9th column should have gene_id, transcript_id, and exon_number. Is that ok?? Moreover, I am confused as to how Cufflinks will report the gene expression given this GTF file. Will it report it based on types of transcripts or on the basis of gene. Do u have any idea or experience?

      Thanks
      Abhijit

      Comment


      • #4
        Hi Boel,

        I extracted the "gene", "mRNA", "exon", and "CDS" fields from a GFF file for drosophila. Here is what it looks like.


        2L FlyBase gene 7529 9484 . + . ID=FBgn0031208;Name=CG11023;Ontology_term=SO:0000010,SO:0000087,GO:0008234,GO:0006508;Dbxref=FlyBase:FBan0011023,FlyBase_Annotation_IDs:CG11023,GB_protein:ACZ94128,GB_protein:AAO41164,GB:AI944728,GB:AJ564667,GB_protein:CAD92822,GB:BF495604,UniProt/TrEMBL:Q6KEV3,UniProt/TrEMBL:Q86BM6,INTERPRO:IPR003653,EntrezGene:33155,BIOGRID:59420,FlyAtlas:CG11023-RA,GenomeRNAi_gene:33155;gbunit=AE014134;derived_computed_cyto=21A5-21A5
        2L FlyBase mRNA 7529 9484 . + . ID=FBtr0300689;Name=CG11023-RB;Parent=FBgn0031208;Dbxref=REFSEQ:NM_001169365,FlyBase_Annotation_IDs:CG11023-RB;score_text=Weakly Supported;score=3
        2L FlyBase mRNA 7529 9484 . + . ID=FBtr0300690;Name=CG11023-RC;Parent=FBgn0031208;Dbxref=REFSEQ:NM_175941,FlyBase_Annotation_IDs:CG11023-RC;score_text=Moderately Supported;score=7
        2L FlyBase exon 7529 8116 . + . ID=FBgn0031208:1;Name=CG11023:1;Parent=FBtr0300689,FBtr0300690;parent_type=mRNA
        2L FlyBase CDS 7680 8116 . + 0 ID=CDS_FBgn0031208:1_763;Name=CG11023-cds;Parent=FBtr0300689,FBtr0300690;parent_type=mRNA
        2L FlyBase exon 8193 9484 . + . ID=FBgn0031208:3;Name=CG11023:3;Parent=FBtr0300689;parent_type=mRNA
        2L FlyBase CDS 8193 8610 . + 2 ID=CDS_FBgn0031208:3_763;Name=CG11023-cds;Parent=FBtr0300689;parent_type=mRNA
        2L FlyBase CDS 8193 8589 . + 2 ID=CDS_FBgn0031208:2_763;Name=CG11023-cds;Parent=FBtr0300690;parent_type=mRNA
        2L FlyBase exon 8193 8589 . + . ID=FBgn0031208:2;Name=CG11023:2;Parent=FBtr0300690;parent_type=mRNA
        2L FlyBase exon 8668 9484 . + . ID=FBgn0031208:4;Name=CG11023:4;Parent=FBtr0300690;parent_type=mRNA[/SIZE]

        Which identifiers need to be changed? For example I know that "ID=" should be changed to gene_id or sometimes transcript_id. When the "Parent" identifier lists two or more separate transcripts do I take that CDS/exon and duplicate it to reflect the separate transcripts? Also when writing the gene field, do I leave the transcript_id and transcript_name blank?

        Thanks
        Abhijit

        Comment


        • #5
          In my GFT files I only keep the rows with exon information, and every row has gene_id, transcript_id and exon_number. This is what I run Cufflinks on.

          If you give Cufflinks a GTF where multiple isoforms (transcripts) are given, then Cufflinks will try to build those isoforms.

          Comment


          • #6
            I saw the GTF website and it also has things like start_codon, stop_codon, and 3UTR and 5UTR entries. You say that these won't be necessary.

            Comment


            • #7
              May I chat with you on this? I can use gmail. My address is [email protected]

              Thanks
              Abhijit

              Comment


              • #8
                Hello,

                I have an entry in the GFF format that I am converting to GTF.

                2L FlyBase exon 38535 38731 . - . ID=FBgn0051973:13;Name=CG31973:13;Parent=FBtr0078163,FBtr0078164,FBtr0113415,FBtr0113416;parent_type=mRNA

                2L FlyBase exon 38535 38731 . - . gene_id "FBgn0051973"; transcript_id "FBtr0078163" "FBtr0078164" "FBtr0113415" "FBtr0113416"; exon_number "13"; gene_name "CG31973";

                Is this a valid conversion?

                Thanks
                Abhijit

                Comment


                • #9
                  Dear Abhijit,

                  why do u want to convert GFF to GTF, when you can easly download the GTF file itself from UCSC genome browser for drosophila

                  Comment


                  • #10
                    The newest release is 5.32. The UCSC GTF files are from 2006. A lot of IDs have changed.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM
                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    18 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    22 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 09:21 AM
                    0 responses
                    17 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-04-2024, 09:00 AM
                    0 responses
                    49 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X