Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • CEGMA error

    Hi mates,

    I've been trying to run CEGMA on a very small amount of contigs from an eukaryote species.

    CEGMA is stopping and showing this message:

    Illegal division by zero at /usr/local/src/cegma/lib/geneid.pm line 28.


    What could that be? Any clue?


    Thanks,
    Condomitti.

  • #2
    I'm guessing that your genome assembly doesn't contain any core genes at all (CEGMA sort of assumes that something will be there, but that's not always the case). Can you show us the full output you have seen, and also run 'ls -l' and include that output too?

    Comment


    • #3
      Originally posted by kbradnam View Post
      I'm guessing that your genome assembly doesn't contain any core genes at all (CEGMA sort of assumes that something will be there, but that's not always the case). Can you show us the full output you have seen, and also run 'ls -l' and include that output too?
      Thanks for your reply, kbradnam!

      These are the output files:
      Code:
      cegma.err
      cegma.log
      teste.cegma.errors
      This is what I'm using to launch CEGMA:

      Code:
      cegma --genome input_contigs.fa -threads 80 -o teste -v 1>cegma.log 2>cegma.err
      teste.cegma.erros only contains:

      Code:
      Illegal division by zero at /usr/local/src/cegma/lib/geneid.pm line 28.
      I see line 28 contains:

      Code:
      $mean = $sum / $total;

      Thanks a lot.
      Cheers,
      Condomitti.

      Comment


      • #4
        Your log file will include details of each step in the CEGMA pipeline, along with the specific command run at each stage. The first command is 'genome_map'. You could try running each step separately. But I'm betting that the problem is that there are no core genes.

        How big is your input file (total number of bp) and what's the average length (N50 or mean)?

        Comment


        • #5
          You are right... that's probably related to my contigs... I just did a tail -n5000 and made sure there weren't any truncated contigs. My intention with this small subset was to have an ideia of the exec time and try to predict the necessary time to execute with the complete set of contigs.

          I'll try a new run with a bigger set.



          This is what I got from cegma.log:

          Code:
          Building a new DB, current time: 04/14/2014 12:04:43
          New DB name:   /tmp/genome66217.blastdb
          New DB title:  input_contigs.fa
          Sequence type: Nucleotide
          Keep Linkouts: T
          Keep MBits: T
          Maximum file size: 1000000000B
          Adding sequences from FASTA; added 2500 sequences in 0.24885 seconds.
          Processing KOG: KOG0002 
          Processing KOG: KOG0003 
          Processing KOG: KOG0018 
          Processing KOG: KOG0019 
          Processing KOG: KOG0025 
          Processing KOG: KOG0047 
          Processing KOG: KOG0062 
          Processing KOG: KOG0073 
          ...
          ...
          (KEEP GOING PROCESSING KOG...)
          ...
          Processing geneid prediction: KOG0209.8
          Processing geneid prediction: KOG0276.7
          Processing geneid prediction: KOG0402.3
          Processing geneid prediction: KOG0933.5
          Processing geneid prediction: KOG0933.2
          Processing geneid prediction: KOG0948.8
          Processing geneid prediction: KOG0969.7
          Processing geneid prediction: KOG0969.5
          Processing geneid prediction: KOG0985.13
          Processing geneid prediction: KOG0996.5
          Processing geneid prediction: KOG1062.6
          Processing geneid prediction: KOG1795.7
          Processing geneid prediction: KOG2004.7
          Processing geneid prediction: KOG2004.4
          DATA COLLECTED: 1 Coding sequences containing 0 introns
          and cegma.err:
          Code:
          ********************************************************************************
          **                    MAPPING PROTEINS TO GENOME (TBLASTN)                    **
          ********************************************************************************
          
          RUNNING: genome_map  -n genome -p 6 -o 5000 -c 2000 -t 80  -v  /usr/local/src/cegma/data/kogs.fa input_contigs.fa 2>teste.cegma.errors
          Found 2209 candidate regions in input_contigs.fa
          
          
          ********************************************************************************
          **     MAKING INITIAL GENE PREDICTIONS FOR CORE GENES (GENEWISE + GENEID)     **
          ********************************************************************************
          
          RUNNING: local_map -n local -f -h /usr/local/src/cegma/data/hmm_profiles -i KOG -v  genome.chunks.fa 2>teste.cegma.errors
          NOTE: created 14 geneid predictions
          
          
          ********************************************************************************
          **           FILTERING INITIAL PROTEINS PRODUCED BY GENEID (HMMER)            **
          ********************************************************************************
          
          RUNNING: hmm_select -i KOG -o local -t 80 -v  /usr/local/src/cegma/data/hmm_profiles local.geneid.fa /usr/local/src/cegma/data/profiles_cutoff.tbl 2>teste.cegma.errors
          NOTE: Found 1 geneid predictions with scores above threshold value
          
          
          ********************************************************************************
          **       CALCULATING GENEID PARAMETERS FROM SELECTED GENEID PREDICTIONS       **
          ********************************************************************************
          
          RUNNING: geneid-train -v local.geneid.selected.gff local.geneid.selected.dna geneid_params 2>teste.cegma.errors
          geneid-train did not work properly


          Cheers,
          Condomitti.

          Comment


          • #6
            On an aging Mac Pro, I can run CEGMA against small assemblies in a few hours (using 8 threads), medium assemblies (maybe 100–1000 Mbp) take a day, and a large vertebrate genome can take 2–4 days.

            Things will be slower if you needlessly include thousands of tiny contigs (e.g <1,000) which are unlikely to contain any full length core gene.

            Comment


            • #7
              Hi kbradnam,

              Running with more contigs didn't give me the error =) The issue was indeed related to contigs that had no core genes at all.


              I'll try a full run now using contigs >1,000bp.

              Thanks a lot for all your help!

              Cheers,
              Condomitti.

              Comment


              • #8
                Hi kbradnam,

                I've run cegma quite a few times now and everything is working fine so far.

                My only confusion now is regarding the CEGs used by CEGMA... file .completeness_report gives me statistics based on 248 ultra-conserved CEGs.
                I see in some papers (assemblathon2 for instance) statistics about the 458 CEGs though.
                The command I'm using is the one stated on CEGMA README file (To run CEGMA using the 458 default proteins type:... )

                How can I get the numbers for the total numbers of CEGs?


                Thanks again mate!
                Condomitti.

                Comment


                • #9
                  Total number of CEGs from the 458 set is found by counting the lines in the output.cegma.id file (or count the number of sequences in the output.cegma.id). For the subset of 248, look in the output.completeness.report file.

                  Also see the relevant part of the CEGMA FAQ to understand the differences between the 458 and 248 sets of CEGs: http://korflab.ucdavis.edu/Datasets/...faq.html#link6

                  Comment


                  • #10
                    Thanks kbradnam! I'll take a look.

                    Have a great day!

                    Cheers,
                    Condomitti.

                    Comment


                    • #11
                      Hi kbradnam,

                      Re-reading CEGMA paper I saw this:

                      Code:
                      To avoid predicting short genes, we required that
                      the proportion of the predicted protein that aligns to the
                      profile is at least 70%. Changing or removing this length
                      requirement can allow the mapping protocol to predict
                      either more fragmentary proteins, or fewer but more com-
                      plete proteins. All results in this article use the 70% length
                      cut-off.
                      I can't find how to change the length requirement though (I have looked at the README file, website and publication). Could you please give me a clue on that too?

                      Thanks mate!

                      Condomitti.

                      Comment


                      • #12
                        Hmm, I'm not sure if that is one of the command-line options supported by CEGMA. You'd have to dig around the code to see where that is configured, and I'm afraid I can't help you with that.

                        Comment


                        • #13
                          Ok, I'll take a look inside the source and try to change that.

                          Thanks anyway mate!

                          Cheers,
                          Condomitti.

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Strategies for Sequencing Challenging Samples
                            by seqadmin


                            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                            03-22-2024, 06:39 AM
                          • seqadmin
                            Techniques and Challenges in Conservation Genomics
                            by seqadmin



                            The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                            Avian Conservation
                            Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                            03-08-2024, 10:41 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, Yesterday, 06:37 PM
                          0 responses
                          10 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, Yesterday, 06:07 PM
                          0 responses
                          10 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 03-22-2024, 10:03 AM
                          0 responses
                          51 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 03-21-2024, 07:32 AM
                          0 responses
                          67 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X