CEGMA VM, by default, uses a 'kogs.fa' file as the protein sequence input to compare to the user's genome sequence input.
kogs.fa contains ~2700 sequences, which I am guessing is the complete complement of KOGS from year 2003. CEGMA publications cite a much smaller, more highly curated KOG sets as being useful (458 CEGS, core eukaryotic genes, further winnowed to 248 most-conserved CEGS) . Does anyone know why kogs.fa is the default? Does it get 'curated' down to a smaller set during a CEGMA VM run?
CEGMA VM output, for me,so far, includes many KOG IDs but no descripition of what protein name/function each KOG ID represents. This makes it not so useful for annotating new genomes. Is there a lookup table somewhere?
(I just posted similar questions to the CEGMA mailing list so I hope the mods there and here don't mind)
kogs.fa contains ~2700 sequences, which I am guessing is the complete complement of KOGS from year 2003. CEGMA publications cite a much smaller, more highly curated KOG sets as being useful (458 CEGS, core eukaryotic genes, further winnowed to 248 most-conserved CEGS) . Does anyone know why kogs.fa is the default? Does it get 'curated' down to a smaller set during a CEGMA VM run?
CEGMA VM output, for me,so far, includes many KOG IDs but no descripition of what protein name/function each KOG ID represents. This makes it not so useful for annotating new genomes. Is there a lookup table somewhere?
(I just posted similar questions to the CEGMA mailing list so I hope the mods there and here don't mind)
Comment