Seqanswers Leaderboard Ad

**kbradnam** · 09-18-2014, 01:50 PM

Hi,

The kogs.fa file contains 6 proteins per KOG. One for each species used to generate the subset of 458 KOGs. The main part of CEGMA is to determine which of these 458 core eukaryotic genes (CEGs) are present in your input file.

This was all for the original purpose of CEGMA, to find a handful of genes which someone could then use to train a gene finder. After it has found which of the 458 CEGs are present, it then performs its secondary role of assessing the completeness of the gene space.

To do this, it only wants to use the most conserved, and least paralogous of the 458 CEGs. So it uses information in the completeness_cutoff.tbl file (inside the CEGMA data directory) to narrow the results down to a subset of 248 CEGs.

CEGMA itself is not meant to be used to annotate a new genome. If it is a genome devoid of any functional annotation, then use however many of the 458 CEGs are found to train a gene finder. If you want to get a feeling for how complete the gene space (and by proxy, the entire genome is), use the results from the subset of 248 CEGs.

We include KOG IDs so that you can cross-reference them with the original KOGs database (now over a decade old): http://www.ncbi.nlm.nih.gov/COG/

**ssully** · 09-19-2014, 02:46 PM

Thanks for explaining that. I now have my lookup table.

Another thing I'd like to be clearer on: the relationship between the 'completeness' metrics and the rest of the output. For exmaple:

I have one assembly (of a genome with many gene families) that returns on the 'complete' line in .completeness_report:
#Proteins 156 (out of 248)
#Total 262 (this being the previous number plus orthologs)

However, the number of seqs in the corresponding .fa file is 205 -- more than #proteins but less than #total. Each KOG ID in .fa is unique, so I presume the reason .fa contains more sequences than #proteins is that the matches are to the set of 458 CEGS, not just the most conserved set of 248 CEGS.

However, when I count the KOG IDs in .fa that match the 248 KOG IDs (which I gleaned from CEGMA/data/completeness_cutoff.tbl) , I get 105 matches, not 156. So 51 of the matches to the 248 CEGs did not end up in .fa.

I am guessing this difference is because determining 'completeness' is a separate process and intent from determining the set to be used for gene finding (the .fa file). Can you clarify?

**kbradnam** · 09-22-2014, 07:19 AM

Basically, all output from CEGMA apart from the output.completeness_report file is referring to the larger set of 458 CEGs. The last file generated by CEGMA is output.completeness_report and that is the only place where results for the subset of 248 CEGs are calculated.

Your last point is mostly correct: the 248 CEGs are based on different filtering criteria than the 458 CEGs, but I think that everything counted in the 248 results should be in the 458 results (but not vice versa).

Are you trying to exact pattern matching? Note that KOG IDs get various numerical suffixes added during the CEGMA run. E.g. an ortholog of KOG0062 might be detected in your input file but CEGMA might give this a name of KOG0062.5 (I won't go into the reasons why this is the case). Might this explain your discrepancy in numbers?

**kbradnam** · 09-22-2014, 08:21 AM

I've expanded in some more details on your questions in a blog post, which includes more information about functional annotations of KOGs in particular:

Another CEGMA post: KOGs vs CEGs and 458 vs 248 — ACGT

http://www.acgt.me/blog/2014/9/22/another-cegma-post-kogs-vs-cegs-and-458-vs-248

I posted another answer about CEGMA on seqanswers.com last week. I thought I'd cover this in a little more detail here (note, questions edited from how they originally appeared): Question 1: CEGMA uses a 'kogs.fa' file — containing 2,748 proteins — to compare to a user's genome sequence. Thes

**ssully** · 09-22-2014, 01:53 PM

Originally posted by kbradnam View Post

Basically, all output from CEGMA apart from the output.completeness_report file is referring to the larger set of 458 CEGs. The last file generated by CEGMA is output.completeness_report and that is the only place where results for the subset of 248 CEGs are calculated.

Your last point is mostly correct: the 248 CEGs are based on different filtering criteria than the 458 CEGs, but I think that everything counted in the 248 results should be in the 458 results (but not vice versa).

Are you trying to exact pattern matching? Note that KOG IDs get various numerical suffixes added during the CEGMA run. E.g. an ortholog of KOG0062 might be detected in your input file but CEGMA might give this a name of KOG0062.5 (I won't go into the reasons why this is the case). Might this explain your discrepancy in numbers?

I did remove all the KOG suffixes before pattern matching. ..so the disparity isn't due to that. It really appears that the .completeness_report KOGs are not fully included within the .fa file KOGS.

Can you replicate that in your own results?

**kbradnam** · 09-22-2014, 02:07 PM

I don't see this:

$ cut -f 1 -d " " completeness_cutoff.tbl > 248_CEG_ids.txt
$ grep ">" kogs.fa | sed 's/.*___//' | sort -u > 458_CEG_ids.txt
$ grep -f 248_CEGs_ids.txt 458_CEG_ids.txt | wc -l

The last step reports a count of 248. I.e. all CEG IDs present in completeness_cutoff.tbl are present in logs.fa.

**ssully** · 09-23-2014, 08:14 AM

Sorry, I wasn't really referring to the two /data files (kogs.fa and completeness_cutoff.tbl) I was referring to my output files. Let's call them my.completeness_report and my.cegma.fa

we've established that
-my.completeness_report reports matches to a master set of 248 CEGs (listed in completeness_cutoff.tbl).
-my.cegma.fa reports matches to a master set of 458 CEGs (listed in kogs.fa)
-The smaller master set is wholly contained within the larger master set

In my CEGMA run, I got these results
156 CEGs reported in my.completeness_report
205 CEGS reported in my.cegma.fa

I assumed that all 156 CEGS in the first set would be in the second set of 205 too.

But this is not the case.

If I compare the 156 CEG IDs (which are not explicitly reported, but can be derived from the complementary list of NON-matches in my.compleness_report) to the 205 CEG IDs in my.cegma.fa, I find just 101 CEGs shared by both.

If I compare the master set of 248 CEG IDs in /data/completeness_cutoff.tbl to the 205 CEG IDs in my.cegma.fa, I find 105 CEGs shared by both. (i.e., my.cegma.fa not only does not contain all 156 from the 248 masterset, it is also reporting 4 novel CEGs from that masterset!)

So I conclude that the my.completeness_report number (156 hits to a master set of 248 CEGs) is derived very differently from, and independently of, the set reported to me in my.cegma.fa (205 hits to a master set of 458 CEGs).

**kbradnam** · 09-23-2014, 08:19 AM

Hmm. That is odd. This is part of CEGMA that I don't fully understand myself (I didn't write the code) and this is not something I can easily check I'm afraid. We have no current funding to support or develop CEGMA (hopefully this might one day change).

**ssully** · 09-23-2014, 10:32 AM

Perhaps if you do one of your own CEGMA runs, try checking as I did and see if you get incompletely overlapping results in those two outputs.

Topics	Statistics	Last Post
Bacterial Timeline Study Suggests Oxygen Use Preceded Photosynthesis by seqadmin Started by seqadmin, Today, 12:59 PM	0 responses 6 views 0 reactions	Last Post by seqadmin Today, 12:59 PM
New Software Simplifies 3D Gene Expression Mapping by seqadmin Started by seqadmin, Yesterday, 10:17 AM	0 responses 8 views 0 reactions	Last Post by seqadmin Yesterday, 10:17 AM
AI Tool Creates High-Resolution 3D Maps of the Mouse Brain by seqadmin Started by seqadmin, 03-20-2025, 05:03 AM	0 responses 49 views 0 reactions	Last Post by seqadmin 03-20-2025, 05:03 AM
Studying Microbial Gene Transfer with RNA Barcoding by seqadmin Started by seqadmin, 03-19-2025, 07:27 AM	0 responses 60 views 0 reactions	Last Post by seqadmin 03-19-2025, 07:27 AM

Seqanswers Leaderboard Ad

CEGMA VM: questions about inputs and outputs

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News