Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Augustus gene prediction CDS problem

    I am using Augustus to predict genes. It works well most of the time. However I am dealing with a problem of translation of the extracted codingsequences. I use getAnnoFasta.pl to extract the codingseq and the aminoacids also. If I am not wrong, the codingSeqs should translate exactly the aminoacids sequences. This is not the case in several instances. When I checked my coding seq and translate them many results in other protein totally different form the file containing the aa sequences. Is this a common problem?. Is this best way to extract sequences?. I tried to use the GFF and loaded in artemis to edit error and I noticed that some features lost the frame. Maybe I am missing something.

    Thanks


  • #2
    Could you provide (a) the command you used, (b) the version number of your Augustus installation and (c) a reproducible example where the translation fails? I have used Augustus some times but have never experienced anything like that.

    Comment


    • #3
      I am using augustus 3.0.3.
      It happens with the default prediction ( partial: allow prediction of incomplete genes at the sequence boundaries (default))

      I checked the gff file and only the incomplete genes have this problem.

      I used this command
      ./augustus --species=magnaporthe_grisea --codingseq=1 --protein=1 --exonnames=1 G_citricarpa_assembly_Scf4.fasta > Gcitri.gff

      Then

      perl getAnnoFasta.pl --seqfile=G_citricarpa_assembly_Scf4.fasta Gcitri.gff

      to obtain CodingSeq, and aa



      For example in my file: Gcitri.gff
      I have an "incomplete call"

      # ----- prediction on sequence number 2169 (length = 1527, name = NODE_2169_length_1527_cov_14.2353_ID_4337) -----
      #
      # Constraints/Hints:
      # (none)
      # Predicted genes for sequence number 2169 on both strands
      # start gene g9811
      NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS gene 1 870 0.3 + . g9811
      NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS transcript 1 870 0.3 + . g9811.t1
      NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS internal 217 389 0.55 + 2 transcript_id "g9811.t1"; gene_id "g9811";
      NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS terminal 442 870 0.44 + 0 transcript_id "g9811.t1"; gene_id "g9811";
      NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS intron 1 216 0.58 + . transcript_id "g9811.t1"; gene_id "g9811";
      NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS intron 390 441 0.75 + . transcript_id "g9811.t1"; gene_id "g9811";
      NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS CDS 217 389 0.55 + 2 transcript_id "g9811.t1"; gene_id "g9811";
      NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS CDS 442 867 0.44 + 0 transcript_id "g9811.t1"; gene_id "g9811";
      NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS stop_codon 868 870 . + 0 transcript_id "g9811.t1"; gene_id "g9811";
      # coding sequence = [ataatctgcagcgcaagctcgtcctgctgcaccgacgcgctcaccgcgaccccagcgctcgcggctcctcaatcaccgt
      # cttgggctgcacgctgtggacggcgattcccgacgatgcgcggcaggccgtggcgggcgccgtgagcgacttttccaagatccagccaggactgagcg
      # aagtcgagtccgtttccgaggaggacgattcggcgctgctgttggtcgtcacgcaccacgcgccctgcgtccagctcacgtcgagcccgcgccacgtc
      # ggcaacccgtggactgcggcattcgccacggacctgcttggatcggcggcaggagggggaggaggtgctgagtgctgggatgccgtcagggtgtgggt
      # tttcgggcacacgcactacccgaccaacttcaaggtgaatggcaccagggtcgtaagtaatcagcgcggctacgttctcccagggctcgtgagggtga
      # aggatggagaagatggctctgggaagaaccaccagtcgatgtctcgaagacgattcgtgtttgaagtgccagactgccatgccagactgccagtcgca
      # tcacgagcctcctcgcaggataggctggactga]
      # protein sequence = [NLQRKLVLLHRRAHRDPSARGSSITVLGCTLWTAIPDDARQAVAGAVSDFSKIQPGLSEVESVSEEDDSALLLVVTHH APCVQLTSSPRHVGNPWTAAFATDLLGSAAGGGGGAECWDAVRVWVFGHTHYPTNFKVNGTRVVSNQRGYVLPGLVRVKDGEDGSGKNHQSMSRRRFV
      FEVPDCHARLPVASRASSQDRLD]
      # end gene g9811


      The file Gtcitri.aa obtained using getAnnoFasta.pl gives the same aa sequence
      NLQRKLVLLHRRAHRDPSARGSSITVLGCTLWTAIPDDARQAVAGAVSDFSKIQPGLSEVESVSEEDDSALLLVVTHH
      APCVQLTSSPRHVGNPWTAAFATDLLGSAAGGGGGAECWDAVRVWVFGHTHYPTNFKVNGTRVVSNQRGYVLPGLVRVKDGEDGSGKNHQSMSRRRFVFEVPDCHARLPVASRASSQDRLD
      This is good

      the file Gcitri.CodingSeq obtained using getAnnoFasta.pl gives the whole "coding sequence"
      ataatctgcagcgcaagctcgtcctgctgcaccgacgcgctcaccgcgaccccagcgctcgcggctcctcaatcaccgt
      # cttgggctgcacgctgtggacggcgattcccgacgatgcgcggcaggccgtggcgggcgccgtgagcgacttttccaagatccagccaggactgagcg
      # aagtcgagtccgtttccgaggaggacgattcggcgctgctgttggtcgtcacgcaccacgcgccctgcgtccagctcacgtcgagcccgcgccacgtc
      # ggcaacccgtggactgcggcattcgccacggacctgcttggatcggcggcaggagggggaggaggtgctgagtgctgggatgccgtcagggtgtgggt
      # tttcgggcacacgcactacccgaccaacttcaaggtgaatggcaccagggtcgtaagtaatcagcgcggctacgttctcccagggctcgtgagggtga
      # aggatggagaagatggctctgggaagaaccaccagtcgatgtctcgaagacgattcgtgtttgaagtgccagactgccatgccagactgccagtcgca
      # tcacgagcctcctcgcaggataggctggactga
      !!!

      If you translate that in frame 1
      You obtain this

      IICSASSSCCTDALTATPALAAPQSPSWAARCGRRFPTMRGRPWRAP*ATFPRSSQD*AK
      SSPFPRRTIRRCCWSSRTTRPASSSRRARATSATRGLRHSPRTCLDRRQEGEEVLSAGMP
      SGCGFSGTRTTRPTSR*MAPGS*VISAATFSQGS*G*RMEKMALGRTTSRCLEDDSCLKC
      QTAMPDCQSHHEPPRRIGWT

      Agains The gff file says that the first CDS starts at 217 in frame 2 , I believe that this is the right frame for the Codseq. It seems that I am missing something when I use getAnnoFasta.pl.

      Of course the start codon feature is missed too. When I loaded the gff file in artemis I also have features of the incomplete genes out of frame.


      Thanks
      Last edited by joxcargator73; 03-23-2015, 03:55 PM.

      Comment


      • #4
        Originally posted by joxcargator73 View Post
        I am using augustus 3.0.3.
        It happens with the default prediction ( partial: allow prediction of incomplete genes at the sequence boundaries (default))

        I checked the gff file and only the incomplete genes have this problem.
        ok, the prediction is fine then. I'll explain why:

        Originally posted by joxcargator73 View Post
        I used this command
        ./augustus --species=magnaporthe_grisea --codingseq=1 --protein=1 --exonnames=1 G_citricarpa_assembly_Scf4.fasta > Gcitri.gff

        Then

        perl getAnnoFasta.pl --seqfile=G_citricarpa_assembly_Scf4.fasta Gcitri.gff

        to obtain CodingSeq, and aa
        Just a comment: I assume that there are no closer related species than Magnaporthe (eg something from the Dothideomycetes) available? In that case it is important to keep in mind that some of your gene predictions will probably be different when using alternative gene models. But for an initial working set everything is fine and you don't have to bother.

        Originally posted by joxcargator73 View Post
        For example in my file: Gcitri.gff
        I have an "incomplete call"

        # ----- prediction on sequence number 2169 (length = 1527, name = NODE_2169_length_1527_cov_14.2353_ID_4337) -----
        #
        # Constraints/Hints:
        # (none)
        # Predicted genes for sequence number 2169 on both strands
        # start gene g9811
        NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS gene 1 870 0.3 + . g9811
        NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS transcript 1 870 0.3 + . g9811.t1
        NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS internal 217 389 0.55 + 2 transcript_id "g9811.t1"; gene_id "g9811";
        NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS terminal 442 870 0.44 + 0 transcript_id "g9811.t1"; gene_id "g9811";
        NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS intron 1 216 0.58 + . transcript_id "g9811.t1"; gene_id "g9811";
        NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS intron 390 441 0.75 + . transcript_id "g9811.t1"; gene_id "g9811";
        NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS CDS 217 389 0.55 + 2 transcript_id "g9811.t1"; gene_id "g9811";
        NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS CDS 442 867 0.44 + 0 transcript_id "g9811.t1"; gene_id "g9811";
        NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS stop_codon 868 870 . + 0 transcript_id "g9811.t1"; gene_id "g9811";
        # coding sequence = [ataatctgcagcgcaagctcgtcctgctgcaccgacgcgctcaccgcgaccccagcgctcgcggctcctcaatcaccgt
        # cttgggctgcacgctgtggacggcgattcccgacgatgcgcggcaggccgtggcgggcgccgtgagcgacttttccaagatccagccaggactgagcg
        # aagtcgagtccgtttccgaggaggacgattcggcgctgctgttggtcgtcacgcaccacgcgccctgcgtccagctcacgtcgagcccgcgccacgtc
        # ggcaacccgtggactgcggcattcgccacggacctgcttggatcggcggcaggagggggaggaggtgctgagtgctgggatgccgtcagggtgtgggt
        # tttcgggcacacgcactacccgaccaacttcaaggtgaatggcaccagggtcgtaagtaatcagcgcggctacgttctcccagggctcgtgagggtga
        # aggatggagaagatggctctgggaagaaccaccagtcgatgtctcgaagacgattcgtgtttgaagtgccagactgccatgccagactgccagtcgca
        # tcacgagcctcctcgcaggataggctggactga]
        # protein sequence = [NLQRKLVLLHRRAHRDPSARGSSITVLGCTLWTAIPDDARQAVAGAVSDFSKIQPGLSEVESVSEEDDSALLLVVTHH APCVQLTSSPRHVGNPWTAAFATDLLGSAAGGGGGAECWDAVRVWVFGHTHYPTNFKVNGTRVVSNQRGYVLPGLVRVKDGEDGSGKNHQSMSRRRFV
        FEVPDCHARLPVASRASSQDRLD]
        # end gene g9811


        The file Gtcitri.aa obtained using getAnnoFasta.pl gives the same aa sequence
        NLQRKLVLLHRRAHRDPSARGSSITVLGCTLWTAIPDDARQAVAGAVSDFSKIQPGLSEVESVSEEDDSALLLVVTHH
        APCVQLTSSPRHVGNPWTAAFATDLLGSAAGGGGGAECWDAVRVWVFGHTHYPTNFKVNGTRVVSNQRGYVLPGLVRVKDGEDGSGKNHQSMSRRRFVFEVPDCHARLPVASRASSQDRLD
        This is good

        the file Gcitri.CodingSeq obtained using getAnnoFasta.pl gives the whole "coding sequence"
        ataatctgcagcgcaagctcgtcctgctgcaccgacgcgctcaccgcgaccccagcgctcgcggctcctcaatcaccgt
        # cttgggctgcacgctgtggacggcgattcccgacgatgcgcggcaggccgtggcgggcgccgtgagcgacttttccaagatccagccaggactgagcg
        # aagtcgagtccgtttccgaggaggacgattcggcgctgctgttggtcgtcacgcaccacgcgccctgcgtccagctcacgtcgagcccgcgccacgtc
        # ggcaacccgtggactgcggcattcgccacggacctgcttggatcggcggcaggagggggaggaggtgctgagtgctgggatgccgtcagggtgtgggt
        # tttcgggcacacgcactacccgaccaacttcaaggtgaatggcaccagggtcgtaagtaatcagcgcggctacgttctcccagggctcgtgagggtga
        # aggatggagaagatggctctgggaagaaccaccagtcgatgtctcgaagacgattcgtgtttgaagtgccagactgccatgccagactgccagtcgca
        # tcacgagcctcctcgcaggataggctggactga
        !!!

        If you translate that in frame 1
        You obtain this

        IICSASSSCCTDALTATPALAAPQSPSWAARCGRRFPTMRGRPWRAP*ATFPRSSQD*AK
        SSPFPRRTIRRCCWSSRTTRPASSSRRARATSATRGLRHSPRTCLDRRQEGEEVLSAGMP
        SGCGFSGTRTTRPTSR*MAPGS*VISAATFSQGS*G*RMEKMALGRTTSRCLEDDSCLKC
        QTAMPDCQSHHEPPRRIGWT

        Agains The gff file says that the first CDS starts at 217 in frame 2 , I believe that this is the right frame for the Codseq. It seems that I am missing something when I use getAnnoFasta.pl.

        Of course the start codon feature is missed too. When I loaded the gff file in artemis I also have features of the incomplete genes out of frame.

        Thanks
        1) What you get is what you see. The perl script does nothing more than extracting everything between "# coding sequence = [" and its respective closing bracket (Same thing for protein sequence).

        2) Augustus uses 0 to indicate 1st frame, 1 for 2nd and 2 for 3rd and if you translate the coding sequence in all frames, you'll also see that it is the 3rd frame translation.

        3) I highlighted the most important part. You have a partial gene which was detected at the start of a contig. Obivously, for such a prediction no start "ATG" can be found (as you have noticed). But your prediction starts with an intron and to see a "frame-shift" is absolutely fine in that case. Keep in might that you cannot necessarily translate every exon independently in 1st frame!

        4) It doesn't make sense to translate in a frame that contains premature stop codons as this only indicates that the prediction is somehow wrong. For your example, there seems to be evidence for a transcript of that certain length and there is 1 frame which completely covers this region with only 1 stop at the end. This frame is shown.

        I hope this helps

        Comment


        • #5
          Thanks, for your help.
          The problem that I still have , it is how to deal with the codingSeqs. If some of them (mostly the incomplete calls) do not start in frame and I need the nucleotide sequence to run a codon usage analysis , I will get a something else but not what I expect. Moreover if I compare with the aminoacid tarlatan that Ausguts shows. Should I avoid using truncate calls. Are the complete gene calls start in frame, are these safe to just take their "coding seqs" and translate them?

          Thanks for your help

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM
          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          24 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          25 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          21 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          52 views
          0 likes
          Last Post seqadmin  
          Working...
          X