Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Something wrong in FlyBase's gtf (gff to gtf conversion)

    Hi All,

    I wanted to re-create FlyBase's gtf (FB2015_04) from the gff file. It is a different matter why I want to do that.

    So I parsed the dmel-all-r6.07.gff to a gtf file using my own program. I found a few genes/transcripts that are not what I expected. Bear with me on that one. For the sake of simplicity I am giving an example for one gene, but there are 23 such cases.

    In the gff file for gene FBgn0031926 and transcript FBtr0335486 these are the lines, excluding a few not relevant ones.

    2L FlyBase CDS 7613405 7614199 . + 0 Parent=FBtr0079472,FBtr0335486
    2L FlyBase CDS 7614326 7614695 . + 0 Parent=FBtr0079472,FBtr0335486
    2L FlyBase CDS 7614843 7615444 . + 2 Parent=FBtr0335486
    2L FlyBase CDS 7615576 7615578 . + 0 Parent=FBtr0335486
    2L FlyBase three_prime_UTR 7615579 7615967 . + . Parent=FBtr0335486
    2L FlyBase three_prime_UTR 7616117 7616533 . + . Parent=FBtr0335486


    The start_codon is not a problem and the first 3 CDS. The problem comes when one tries to create a stop_codon. The last CDS (7615576-7615578) is basically the stop codon. So from that the stop_codon becomes:
    2L FlyBase CDS 7615576 7615578

    Then one has to delete the last CDS (7615576-7615578), as it is just the stop_codon. This is how I parse it:

    2L FlyBase start_codon 7613405 7613407 . + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
    2L FlyBase CDS 7613405 7614199 . + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
    2L FlyBase CDS 7614326 7614695 . + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
    2L FlyBase CDS 7614843 7615444 . + 2 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
    2L FlyBase stop_codon 7615576 7615578 . + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
    2L FlyBase 3UTR 7615579 7615967 . + . gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
    2L FlyBase 3UTR 7616117 7616533 . + . gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";


    Everything is as it should be. Nevertheless, the FlyBase gtf file for this transcript has the following:

    2L FlyBase start_codon 7613405 7613407 . + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
    2L FlyBase CDS 7613405 7614199 7 + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
    2L FlyBase CDS 7614326 7614695 7 + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
    2L FlyBase CDS 7614843 7615444 7 + 2 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
    2L FlyBase CDS 7615575 7615575 7 + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
    2L FlyBase stop_codon 7615576 7615578 . + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
    2L FlyBase 3UTR 7615579 7615967 7 + . gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
    2L FlyBase 3UTR 7616117 7616533 7 + . gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";


    Look into the last CDS (7615575-7615575), it includes a single base from the intronic region. Either I am wrongly reading the specifications for the GTF files (http://mblab.wustl.edu/GTF22.html) or FlyBase somewhat makes it differently than how it should be.

    I also looked at Ensembl's GTF file and there they completely remove the stop_codon and the 3UTR starts from where the stop_codon should start. They have also removed the last CDS. Ensembl's gtf is also a bit suspicious, as there is no stop_codon for that particular gene and the other 22 cases.

    I also looked at UCSC's (dm3), downloaded from tophat, and there everything is as I calculate the stop_codon.

    My question is, is this an error by FlyBase/Ensembl and how should this be correctly done?

    Many thanks indeed for any insight into this one.

  • #2
    Additional frame inconsistencies

    Unfortunately no one has suggested reasonable explanation for my previous problems.

    Additionally to that I also found a few frame inconsistencies, i.e. column 8 (count from 1).

    For the gene: FBgn0033313 and transcript: FBtr0305081 there is something not quite right with the frame of the start_codons, i.e. column 8.
    The gff for this gene and transcript reads for the first few CDS:

    2R FlyBase CDS 8616078 8616078 . + 0 Parent=FBtr0305081
    2R FlyBase CDS 8616327 8616516 . + 2 Parent=FBtr0310448,FBtr0310449,FBtr0305081
    2R FlyBase CDS 8616700 8618171 . + 1 Parent=FBtr0290112,FBtr0301363,FBtr0310448,FBtr0310449,FBtr0305080,FBtr0305081,FBtr0305082
    2R FlyBase CDS 8618234 8618461 . + 2 Parent=FBtr0290112,FBtr0301363,FBtr0310448,FBtr0310449,FBtr0305080,FBtr0305081,FBtr0305082


    I parsed to:

    2R FlyBase start_codon 8616078 8616078 . + 0 gene_id "FBgn0033313"; gene_symbol "Cirl"; transcript_id "FBtr0305081"; transcript_symbol "Cirl-RG";
    2R FlyBase start_codon 8616327 8616328 . + 2 gene_id "FBgn0033313"; gene_symbol "Cirl"; transcript_id "FBtr0305081"; transcript_symbol "Cirl-RG";
    2R FlyBase CDS 8616078 8616078 . + 0 gene_id "FBgn0033313"; gene_symbol "Cirl"; transcript_id "FBtr0305081"; transcript_symbol "Cirl-RG";
    2R FlyBase CDS 8616327 8616516 . + 2 gene_id "FBgn0033313"; gene_symbol "Cirl"; transcript_id "FBtr0305081"; transcript_symbol "Cirl-RG";


    Nevertheless, in FlyBase's gtf the frame of the second start_codon is:

    2R FlyBase start_codon 8616078 8616078 . + 0 gene_id "FBgn0033313"; gene_symbol "Cirl"; transcript_id "FBtr0305081"; transcript_symbol "Cirl-RG";
    2R FlyBase start_codon 8616327 8616328 . + 1 gene_id "FBgn0033313"; gene_symbol "Cirl"; transcript_id "FBtr0305081"; transcript_symbol "Cirl-RG";
    2R FlyBase CDS 8616078 8616078 15 + 0 gene_id "FBgn0033313"; gene_symbol "Cirl"; transcript_id "FBtr0305081"; transcript_symbol "Cirl-RG";
    2R FlyBase CDS 8616327 8616516 15 + 2 gene_id "FBgn0033313"; gene_symbol "Cirl"; transcript_id "FBtr0305081"; transcript_symbol "Cirl-RG";


    Note the frame is 1 in start_codon 8616327 8616328. As this start_codon has two bases, then according to the gtf2.2 guidelines, the frame should be 2, i.e. the third base in the feature is the start of a codon. This is not the only case of such mis-framing around, I count quite a few.

    I checked this in Ensembl's gtf and this appears to be 2 as I parsed it. Do you think I should I contact FlyBase to inquire about these.

    Many thanks indeed for any help.

    Comment


    • #3
      Originally posted by saskak View Post
      I checked this in Ensembl's gtf and this appears to be 2 as I parsed it. Do you think I should I contact FlyBase to inquire about these.
      Yes. They will know their dataset better than most of us on SeqAnswers. If there is a problem then they will appreciate knowing about it.

      Comment


      • #4
        Solved

        Contacted FlyBase and it turned out they had a bug/s in their annotation pipeline. Should be fixed in the 6.08 gtf file.

        Comment


        • #5
          Thanks for the follow up and getting this corrected!

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM
          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          17 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          22 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          16 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          46 views
          0 likes
          Last Post seqadmin  
          Working...
          X