Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Something wrong in FlyBase's gtf (gff to gtf conversion)

    Hi All,

    I wanted to re-create FlyBase's gtf (FB2015_04) from the gff file. It is a different matter why I want to do that.

    So I parsed the dmel-all-r6.07.gff to a gtf file using my own program. I found a few genes/transcripts that are not what I expected. Bear with me on that one. For the sake of simplicity I am giving an example for one gene, but there are 23 such cases.

    In the gff file for gene FBgn0031926 and transcript FBtr0335486 these are the lines, excluding a few not relevant ones.

    2L FlyBase CDS 7613405 7614199 . + 0 Parent=FBtr0079472,FBtr0335486
    2L FlyBase CDS 7614326 7614695 . + 0 Parent=FBtr0079472,FBtr0335486
    2L FlyBase CDS 7614843 7615444 . + 2 Parent=FBtr0335486
    2L FlyBase CDS 7615576 7615578 . + 0 Parent=FBtr0335486
    2L FlyBase three_prime_UTR 7615579 7615967 . + . Parent=FBtr0335486
    2L FlyBase three_prime_UTR 7616117 7616533 . + . Parent=FBtr0335486


    The start_codon is not a problem and the first 3 CDS. The problem comes when one tries to create a stop_codon. The last CDS (7615576-7615578) is basically the stop codon. So from that the stop_codon becomes:
    2L FlyBase CDS 7615576 7615578

    Then one has to delete the last CDS (7615576-7615578), as it is just the stop_codon. This is how I parse it:

    2L FlyBase start_codon 7613405 7613407 . + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
    2L FlyBase CDS 7613405 7614199 . + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
    2L FlyBase CDS 7614326 7614695 . + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
    2L FlyBase CDS 7614843 7615444 . + 2 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
    2L FlyBase stop_codon 7615576 7615578 . + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
    2L FlyBase 3UTR 7615579 7615967 . + . gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
    2L FlyBase 3UTR 7616117 7616533 . + . gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";


    Everything is as it should be. Nevertheless, the FlyBase gtf file for this transcript has the following:

    2L FlyBase start_codon 7613405 7613407 . + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
    2L FlyBase CDS 7613405 7614199 7 + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
    2L FlyBase CDS 7614326 7614695 7 + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
    2L FlyBase CDS 7614843 7615444 7 + 2 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
    2L FlyBase CDS 7615575 7615575 7 + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
    2L FlyBase stop_codon 7615576 7615578 . + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
    2L FlyBase 3UTR 7615579 7615967 7 + . gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
    2L FlyBase 3UTR 7616117 7616533 7 + . gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";


    Look into the last CDS (7615575-7615575), it includes a single base from the intronic region. Either I am wrongly reading the specifications for the GTF files (http://mblab.wustl.edu/GTF22.html) or FlyBase somewhat makes it differently than how it should be.

    I also looked at Ensembl's GTF file and there they completely remove the stop_codon and the 3UTR starts from where the stop_codon should start. They have also removed the last CDS. Ensembl's gtf is also a bit suspicious, as there is no stop_codon for that particular gene and the other 22 cases.

    I also looked at UCSC's (dm3), downloaded from tophat, and there everything is as I calculate the stop_codon.

    My question is, is this an error by FlyBase/Ensembl and how should this be correctly done?

    Many thanks indeed for any insight into this one.

  • #2
    Additional frame inconsistencies

    Unfortunately no one has suggested reasonable explanation for my previous problems.

    Additionally to that I also found a few frame inconsistencies, i.e. column 8 (count from 1).

    For the gene: FBgn0033313 and transcript: FBtr0305081 there is something not quite right with the frame of the start_codons, i.e. column 8.
    The gff for this gene and transcript reads for the first few CDS:

    2R FlyBase CDS 8616078 8616078 . + 0 Parent=FBtr0305081
    2R FlyBase CDS 8616327 8616516 . + 2 Parent=FBtr0310448,FBtr0310449,FBtr0305081
    2R FlyBase CDS 8616700 8618171 . + 1 Parent=FBtr0290112,FBtr0301363,FBtr0310448,FBtr0310449,FBtr0305080,FBtr0305081,FBtr0305082
    2R FlyBase CDS 8618234 8618461 . + 2 Parent=FBtr0290112,FBtr0301363,FBtr0310448,FBtr0310449,FBtr0305080,FBtr0305081,FBtr0305082


    I parsed to:

    2R FlyBase start_codon 8616078 8616078 . + 0 gene_id "FBgn0033313"; gene_symbol "Cirl"; transcript_id "FBtr0305081"; transcript_symbol "Cirl-RG";
    2R FlyBase start_codon 8616327 8616328 . + 2 gene_id "FBgn0033313"; gene_symbol "Cirl"; transcript_id "FBtr0305081"; transcript_symbol "Cirl-RG";
    2R FlyBase CDS 8616078 8616078 . + 0 gene_id "FBgn0033313"; gene_symbol "Cirl"; transcript_id "FBtr0305081"; transcript_symbol "Cirl-RG";
    2R FlyBase CDS 8616327 8616516 . + 2 gene_id "FBgn0033313"; gene_symbol "Cirl"; transcript_id "FBtr0305081"; transcript_symbol "Cirl-RG";


    Nevertheless, in FlyBase's gtf the frame of the second start_codon is:

    2R FlyBase start_codon 8616078 8616078 . + 0 gene_id "FBgn0033313"; gene_symbol "Cirl"; transcript_id "FBtr0305081"; transcript_symbol "Cirl-RG";
    2R FlyBase start_codon 8616327 8616328 . + 1 gene_id "FBgn0033313"; gene_symbol "Cirl"; transcript_id "FBtr0305081"; transcript_symbol "Cirl-RG";
    2R FlyBase CDS 8616078 8616078 15 + 0 gene_id "FBgn0033313"; gene_symbol "Cirl"; transcript_id "FBtr0305081"; transcript_symbol "Cirl-RG";
    2R FlyBase CDS 8616327 8616516 15 + 2 gene_id "FBgn0033313"; gene_symbol "Cirl"; transcript_id "FBtr0305081"; transcript_symbol "Cirl-RG";


    Note the frame is 1 in start_codon 8616327 8616328. As this start_codon has two bases, then according to the gtf2.2 guidelines, the frame should be 2, i.e. the third base in the feature is the start of a codon. This is not the only case of such mis-framing around, I count quite a few.

    I checked this in Ensembl's gtf and this appears to be 2 as I parsed it. Do you think I should I contact FlyBase to inquire about these.

    Many thanks indeed for any help.

    Comment


    • #3
      Originally posted by saskak View Post
      I checked this in Ensembl's gtf and this appears to be 2 as I parsed it. Do you think I should I contact FlyBase to inquire about these.
      Yes. They will know their dataset better than most of us on SeqAnswers. If there is a problem then they will appreciate knowing about it.

      Comment


      • #4
        Solved

        Contacted FlyBase and it turned out they had a bug/s in their annotation pipeline. Should be fixed in the 6.08 gtf file.

        Comment


        • #5
          Thanks for the follow up and getting this corrected!

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Advancing Precision Medicine for Rare Diseases in Children
            by seqadmin




            Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
            12-16-2024, 07:57 AM
          • seqadmin
            Recent Advances in Sequencing Technologies
            by seqadmin



            Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

            Long-Read Sequencing
            Long-read sequencing has seen remarkable advancements,...
            12-02-2024, 01:49 PM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 12-17-2024, 10:28 AM
          0 responses
          27 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 12-13-2024, 08:24 AM
          0 responses
          43 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 12-12-2024, 07:41 AM
          0 responses
          29 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 12-11-2024, 07:45 AM
          0 responses
          42 views
          0 likes
          Last Post seqadmin  
          Working...
          X