SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Question about inputs for ChIP-seq omy567 Epigenetics 8 03-20-2014 08:58 AM
pipe two inputs to a command (coverageBed) hari_iyer16 Bioinformatics 3 04-16-2013 02:51 AM
bwa question on file outputs/inputs ikrier Bioinformatics 12 10-12-2012 09:22 AM
Bowtie. Different outputs from equivalent(?) inputs. AEB Bioinformatics 1 08-18-2011 01:34 AM
gsAssembly (Newbler) de novo behaviour, inputs and outputs nicolallias 454 Pyrosequencing 6 10-29-2010 01:16 AM

Reply
 
Thread Tools
Old 09-16-2014, 09:24 AM   #1
ssully
Member
 
Location: NYC

Join Date: Aug 2010
Posts: 48
Default CEGMA VM: questions about inputs and outputs

CEGMA VM, by default, uses a 'kogs.fa' file as the protein sequence input to compare to the user's genome sequence input.

kogs.fa contains ~2700 sequences, which I am guessing is the complete complement of KOGS from year 2003. CEGMA publications cite a much smaller, more highly curated KOG sets as being useful (458 CEGS, core eukaryotic genes, further winnowed to 248 most-conserved CEGS) . Does anyone know why kogs.fa is the default? Does it get 'curated' down to a smaller set during a CEGMA VM run?


CEGMA VM output, for me,so far, includes many KOG IDs but no descripition of what protein name/function each KOG ID represents. This makes it not so useful for annotating new genomes. Is there a lookup table somewhere?


(I just posted similar questions to the CEGMA mailing list so I hope the mods there and here don't mind)
ssully is offline   Reply With Quote
Old 09-18-2014, 02:50 PM   #2
kbradnam
Member
 
Location: Davis, CA

Join Date: May 2011
Posts: 53
Default

Hi,

The kogs.fa file contains 6 proteins per KOG. One for each species used to generate the subset of 458 KOGs. The main part of CEGMA is to determine which of these 458 core eukaryotic genes (CEGs) are present in your input file.

This was all for the original purpose of CEGMA, to find a handful of genes which someone could then use to train a gene finder. After it has found which of the 458 CEGs are present, it then performs its secondary role of assessing the completeness of the gene space.

To do this, it only wants to use the most conserved, and least paralogous of the 458 CEGs. So it uses information in the completeness_cutoff.tbl file (inside the CEGMA data directory) to narrow the results down to a subset of 248 CEGs.

CEGMA itself is not meant to be used to annotate a new genome. If it is a genome devoid of any functional annotation, then use however many of the 458 CEGs are found to train a gene finder. If you want to get a feeling for how complete the gene space (and by proxy, the entire genome is), use the results from the subset of 248 CEGs.

We include KOG IDs so that you can cross-reference them with the original KOGs database (now over a decade old): http://www.ncbi.nlm.nih.gov/COG/
kbradnam is offline   Reply With Quote
Old 09-19-2014, 03:46 PM   #3
ssully
Member
 
Location: NYC

Join Date: Aug 2010
Posts: 48
Default

Thanks for explaining that. I now have my lookup table.

Another thing I'd like to be clearer on: the relationship between the 'completeness' metrics and the rest of the output. For exmaple:

I have one assembly (of a genome with many gene families) that returns on the 'complete' line in .completeness_report:
#Proteins 156 (out of 248)
#Total 262 (this being the previous number plus orthologs)

However, the number of seqs in the corresponding .fa file is 205 -- more than #proteins but less than #total. Each KOG ID in .fa is unique, so I presume the reason .fa contains more sequences than #proteins is that the matches are to the set of 458 CEGS, not just the most conserved set of 248 CEGS.

However, when I count the KOG IDs in .fa that match the 248 KOG IDs (which I gleaned from CEGMA/data/completeness_cutoff.tbl) , I get 105 matches, not 156. So 51 of the matches to the 248 CEGs did not end up in .fa.

I am guessing this difference is because determining 'completeness' is a separate process and intent from determining the set to be used for gene finding (the .fa file). Can you clarify?

Last edited by ssully; 09-19-2014 at 03:51 PM.
ssully is offline   Reply With Quote
Old 09-22-2014, 08:19 AM   #4
kbradnam
Member
 
Location: Davis, CA

Join Date: May 2011
Posts: 53
Default

Basically, all output from CEGMA apart from the output.completeness_report file is referring to the larger set of 458 CEGs. The last file generated by CEGMA is output.completeness_report and that is the only place where results for the subset of 248 CEGs are calculated.

Your last point is mostly correct: the 248 CEGs are based on different filtering criteria than the 458 CEGs, but I think that everything counted in the 248 results should be in the 458 results (but not vice versa).

Are you trying to exact pattern matching? Note that KOG IDs get various numerical suffixes added during the CEGMA run. E.g. an ortholog of KOG0062 might be detected in your input file but CEGMA might give this a name of KOG0062.5 (I won't go into the reasons why this is the case). Might this explain your discrepancy in numbers?
kbradnam is offline   Reply With Quote
Old 09-22-2014, 09:21 AM   #5
kbradnam
Member
 
Location: Davis, CA

Join Date: May 2011
Posts: 53
Default

I've expanded in some more details on your questions in a blog post, which includes more information about functional annotations of KOGs in particular:

http://www.acgt.me/blog/2014/9/22/an...and-458-vs-248
kbradnam is offline   Reply With Quote
Old 09-22-2014, 02:53 PM   #6
ssully
Member
 
Location: NYC

Join Date: Aug 2010
Posts: 48
Default

Quote:
Originally Posted by kbradnam View Post
Basically, all output from CEGMA apart from the output.completeness_report file is referring to the larger set of 458 CEGs. The last file generated by CEGMA is output.completeness_report and that is the only place where results for the subset of 248 CEGs are calculated.

Your last point is mostly correct: the 248 CEGs are based on different filtering criteria than the 458 CEGs, but I think that everything counted in the 248 results should be in the 458 results (but not vice versa).

Are you trying to exact pattern matching? Note that KOG IDs get various numerical suffixes added during the CEGMA run. E.g. an ortholog of KOG0062 might be detected in your input file but CEGMA might give this a name of KOG0062.5 (I won't go into the reasons why this is the case). Might this explain your discrepancy in numbers?

I did remove all the KOG suffixes before pattern matching. ..so the disparity isn't due to that. It really appears that the .completeness_report KOGs are not fully included within the .fa file KOGS.

Can you replicate that in your own results?
ssully is offline   Reply With Quote
Old 09-22-2014, 03:07 PM   #7
kbradnam
Member
 
Location: Davis, CA

Join Date: May 2011
Posts: 53
Default

I don't see this:

$ cut -f 1 -d " " completeness_cutoff.tbl > 248_CEG_ids.txt
$ grep ">" kogs.fa | sed 's/.*___//' | sort -u > 458_CEG_ids.txt
$ grep -f 248_CEGs_ids.txt 458_CEG_ids.txt | wc -l

The last step reports a count of 248. I.e. all CEG IDs present in completeness_cutoff.tbl are present in logs.fa.
kbradnam is offline   Reply With Quote
Old 09-23-2014, 09:14 AM   #8
ssully
Member
 
Location: NYC

Join Date: Aug 2010
Posts: 48
Default

Sorry, I wasn't really referring to the two /data files (kogs.fa and completeness_cutoff.tbl) I was referring to my output files. Let's call them my.completeness_report and my.cegma.fa

we've established that
-my.completeness_report reports matches to a master set of 248 CEGs (listed in completeness_cutoff.tbl).
-my.cegma.fa reports matches to a master set of 458 CEGs (listed in kogs.fa)
-The smaller master set is wholly contained within the larger master set


In my CEGMA run, I got these results
156 CEGs reported in my.completeness_report
205 CEGS reported in my.cegma.fa

I assumed that all 156 CEGS in the first set would be in the second set of 205 too.

But this is not the case.

If I compare the 156 CEG IDs (which are not explicitly reported, but can be derived from the complementary list of NON-matches in my.compleness_report) to the 205 CEG IDs in my.cegma.fa, I find just 101 CEGs shared by both.

If I compare the master set of 248 CEG IDs in /data/completeness_cutoff.tbl to the 205 CEG IDs in my.cegma.fa, I find 105 CEGs shared by both. (i.e., my.cegma.fa not only does not contain all 156 from the 248 masterset, it is also reporting 4 novel CEGs from that masterset!)


So I conclude that the my.completeness_report number (156 hits to a master set of 248 CEGs) is derived very differently from, and independently of, the set reported to me in my.cegma.fa (205 hits to a master set of 458 CEGs).

Last edited by ssully; 09-23-2014 at 10:30 AM.
ssully is offline   Reply With Quote
Old 09-23-2014, 09:19 AM   #9
kbradnam
Member
 
Location: Davis, CA

Join Date: May 2011
Posts: 53
Default

Hmm. That is odd. This is part of CEGMA that I don't fully understand myself (I didn't write the code) and this is not something I can easily check I'm afraid. We have no current funding to support or develop CEGMA (hopefully this might one day change).
kbradnam is offline   Reply With Quote
Old 09-23-2014, 11:32 AM   #10
ssully
Member
 
Location: NYC

Join Date: Aug 2010
Posts: 48
Default

Perhaps if you do one of your own CEGMA runs, try checking as I did and see if you get incompletely overlapping results in those two outputs.
ssully is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:53 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO