Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • CEGMA VM: questions about inputs and outputs

    CEGMA VM, by default, uses a 'kogs.fa' file as the protein sequence input to compare to the user's genome sequence input.

    kogs.fa contains ~2700 sequences, which I am guessing is the complete complement of KOGS from year 2003. CEGMA publications cite a much smaller, more highly curated KOG sets as being useful (458 CEGS, core eukaryotic genes, further winnowed to 248 most-conserved CEGS) . Does anyone know why kogs.fa is the default? Does it get 'curated' down to a smaller set during a CEGMA VM run?


    CEGMA VM output, for me,so far, includes many KOG IDs but no descripition of what protein name/function each KOG ID represents. This makes it not so useful for annotating new genomes. Is there a lookup table somewhere?


    (I just posted similar questions to the CEGMA mailing list so I hope the mods there and here don't mind)

  • #2
    Hi,

    The kogs.fa file contains 6 proteins per KOG. One for each species used to generate the subset of 458 KOGs. The main part of CEGMA is to determine which of these 458 core eukaryotic genes (CEGs) are present in your input file.

    This was all for the original purpose of CEGMA, to find a handful of genes which someone could then use to train a gene finder. After it has found which of the 458 CEGs are present, it then performs its secondary role of assessing the completeness of the gene space.

    To do this, it only wants to use the most conserved, and least paralogous of the 458 CEGs. So it uses information in the completeness_cutoff.tbl file (inside the CEGMA data directory) to narrow the results down to a subset of 248 CEGs.

    CEGMA itself is not meant to be used to annotate a new genome. If it is a genome devoid of any functional annotation, then use however many of the 458 CEGs are found to train a gene finder. If you want to get a feeling for how complete the gene space (and by proxy, the entire genome is), use the results from the subset of 248 CEGs.

    We include KOG IDs so that you can cross-reference them with the original KOGs database (now over a decade old): http://www.ncbi.nlm.nih.gov/COG/

    Comment


    • #3
      Thanks for explaining that. I now have my lookup table.

      Another thing I'd like to be clearer on: the relationship between the 'completeness' metrics and the rest of the output. For exmaple:

      I have one assembly (of a genome with many gene families) that returns on the 'complete' line in .completeness_report:
      #Proteins 156 (out of 248)
      #Total 262 (this being the previous number plus orthologs)

      However, the number of seqs in the corresponding .fa file is 205 -- more than #proteins but less than #total. Each KOG ID in .fa is unique, so I presume the reason .fa contains more sequences than #proteins is that the matches are to the set of 458 CEGS, not just the most conserved set of 248 CEGS.

      However, when I count the KOG IDs in .fa that match the 248 KOG IDs (which I gleaned from CEGMA/data/completeness_cutoff.tbl) , I get 105 matches, not 156. So 51 of the matches to the 248 CEGs did not end up in .fa.

      I am guessing this difference is because determining 'completeness' is a separate process and intent from determining the set to be used for gene finding (the .fa file). Can you clarify?
      Last edited by ssully; 09-19-2014, 02:51 PM.

      Comment


      • #4
        Basically, all output from CEGMA apart from the output.completeness_report file is referring to the larger set of 458 CEGs. The last file generated by CEGMA is output.completeness_report and that is the only place where results for the subset of 248 CEGs are calculated.

        Your last point is mostly correct: the 248 CEGs are based on different filtering criteria than the 458 CEGs, but I think that everything counted in the 248 results should be in the 458 results (but not vice versa).

        Are you trying to exact pattern matching? Note that KOG IDs get various numerical suffixes added during the CEGMA run. E.g. an ortholog of KOG0062 might be detected in your input file but CEGMA might give this a name of KOG0062.5 (I won't go into the reasons why this is the case). Might this explain your discrepancy in numbers?

        Comment


        • #5
          I've expanded in some more details on your questions in a blog post, which includes more information about functional annotations of KOGs in particular:

          I posted another answer about CEGMA on seqanswers.com last week. I thought I'd cover this in a little more detail here (note, questions edited from how they originally appeared): Question 1: CEGMA uses a 'kogs.fa' file — containing 2,748 proteins — to compare to a user's genome sequence. Thes

          Comment


          • #6
            Originally posted by kbradnam View Post
            Basically, all output from CEGMA apart from the output.completeness_report file is referring to the larger set of 458 CEGs. The last file generated by CEGMA is output.completeness_report and that is the only place where results for the subset of 248 CEGs are calculated.

            Your last point is mostly correct: the 248 CEGs are based on different filtering criteria than the 458 CEGs, but I think that everything counted in the 248 results should be in the 458 results (but not vice versa).

            Are you trying to exact pattern matching? Note that KOG IDs get various numerical suffixes added during the CEGMA run. E.g. an ortholog of KOG0062 might be detected in your input file but CEGMA might give this a name of KOG0062.5 (I won't go into the reasons why this is the case). Might this explain your discrepancy in numbers?

            I did remove all the KOG suffixes before pattern matching. ..so the disparity isn't due to that. It really appears that the .completeness_report KOGs are not fully included within the .fa file KOGS.

            Can you replicate that in your own results?

            Comment


            • #7
              I don't see this:

              $ cut -f 1 -d " " completeness_cutoff.tbl > 248_CEG_ids.txt
              $ grep ">" kogs.fa | sed 's/.*___//' | sort -u > 458_CEG_ids.txt
              $ grep -f 248_CEGs_ids.txt 458_CEG_ids.txt | wc -l

              The last step reports a count of 248. I.e. all CEG IDs present in completeness_cutoff.tbl are present in logs.fa.

              Comment


              • #8
                Sorry, I wasn't really referring to the two /data files (kogs.fa and completeness_cutoff.tbl) I was referring to my output files. Let's call them my.completeness_report and my.cegma.fa

                we've established that
                -my.completeness_report reports matches to a master set of 248 CEGs (listed in completeness_cutoff.tbl).
                -my.cegma.fa reports matches to a master set of 458 CEGs (listed in kogs.fa)
                -The smaller master set is wholly contained within the larger master set


                In my CEGMA run, I got these results
                156 CEGs reported in my.completeness_report
                205 CEGS reported in my.cegma.fa

                I assumed that all 156 CEGS in the first set would be in the second set of 205 too.

                But this is not the case.

                If I compare the 156 CEG IDs (which are not explicitly reported, but can be derived from the complementary list of NON-matches in my.compleness_report) to the 205 CEG IDs in my.cegma.fa, I find just 101 CEGs shared by both.

                If I compare the master set of 248 CEG IDs in /data/completeness_cutoff.tbl to the 205 CEG IDs in my.cegma.fa, I find 105 CEGs shared by both. (i.e., my.cegma.fa not only does not contain all 156 from the 248 masterset, it is also reporting 4 novel CEGs from that masterset!)


                So I conclude that the my.completeness_report number (156 hits to a master set of 248 CEGs) is derived very differently from, and independently of, the set reported to me in my.cegma.fa (205 hits to a master set of 458 CEGs).
                Last edited by ssully; 09-23-2014, 09:30 AM.

                Comment


                • #9
                  Hmm. That is odd. This is part of CEGMA that I don't fully understand myself (I didn't write the code) and this is not something I can easily check I'm afraid. We have no current funding to support or develop CEGMA (hopefully this might one day change).

                  Comment


                  • #10
                    Perhaps if you do one of your own CEGMA runs, try checking as I did and see if you get incompletely overlapping results in those two outputs.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM
                    • seqadmin
                      Techniques and Challenges in Conservation Genomics
                      by seqadmin



                      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                      Avian Conservation
                      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                      03-08-2024, 10:41 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 03-27-2024, 06:37 PM
                    0 responses
                    13 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-27-2024, 06:07 PM
                    0 responses
                    11 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-22-2024, 10:03 AM
                    0 responses
                    53 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-21-2024, 07:32 AM
                    0 responses
                    69 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X