SEQanswers gene ontology over-representation of differentially expressed genes
 Register FAQ Members List Calendar Search Today's Posts Mark Forums Read

 Similar Threads Thread Thread Starter Forum Replies Last Post statsteam Bioinformatics 5 11-15-2013 12:28 PM casshyr RNA Sequencing 3 04-30-2012 10:18 AM cutcopy11 Bioinformatics 5 12-08-2011 01:14 AM stephenhart General 3 11-16-2011 02:14 AM questioner Bioinformatics 6 11-03-2011 08:15 AM

 10-20-2011, 05:58 AM #1 damiankao Member   Location: UK Join Date: Jan 2010 Posts: 49 gene ontology over-representation of differentially expressed genes I am pretty new to the statistical methods used in calculating the probably of over-representation so bear with me. I've been reading this for an intro on this topic: http://users.unimi.it/marray/2007/ma...4/Lecture7.pdf My main question is do we care about the probability of GO over-representation in differentially expressed genes? The p-value after an over-representation analysis is the chance that the GO term appearing in my sub-list is due to random chance. But what does "random chance" mean in the context of differential expression lists? The chance that the gene is not differentially expressed? The chance that the GO assignment was wrong? Example case: -I have a total gene set of 10,000 genes. -500 of those genes have "cell cycle" GO term. -I have a list of 200 differentially expressed genes and 10 of them are cell cycle. If I do a simple hypergeometric test in R (according to the presentation I linked above) with: phyper(9, 500, 10000-500, 200, lower.tail=FALSE) I get a pretty bad p-value of: 0.55 So that p-value is telling me the 10 genes that are cell cycle in my differentially expressed list is not very significant. So if we randomly draw 200 genes from the pool of 10,000 genes, the chances of getting 10 cell cyle genes is 0.55. What does the significance really mean in this context? There is a good chance that the 10 cell cycle genes really weren't differentially expressed in the first place? What if the 10 cell cycle genes had a very high significance in my differential expression analysis?
 10-20-2011, 07:40 AM #2 lmc Junior Member   Location: USA Join Date: Jun 2010 Posts: 6 It means that there is no significant enrichment of cell cycle genes in your subset of genes. As you stated, the number of cell cycle genes in your subset is what you would expect if you randomly selected 200 genes from the set of 10000. In other words, the proportion of genes with a "cell cycle" annotation in your subset of genes is similar to (actually, in this case it is exactly the same as) the proportion of genes with a "cell cycle" annotation in the whole gene set. 10/200=0.05 500/10000=0.05 Any type of over-representation analysis has no relation to whether an individual gene is deferentially expressed. This type of analysis assumes that the genes in your differentially expressed gene list are actually differentially expressed and that the annotations are correct. And the results indicate only if your subset of genes has more of any given annotation than you would expect by chance.
10-20-2011, 08:28 AM   #3
damiankao
Member

Location: UK

Join Date: Jan 2010
Posts: 49

Quote:
 Any type of over-representation analysis has no relation to whether an individual gene is deferentially expressed.
Thanks. So there really is no point in using an over-representation analysis in differentially express genes. It doesn't tell you anything in relation to differential expression.

 10-20-2011, 08:35 AM #4 lmc Junior Member   Location: USA Join Date: Jun 2010 Posts: 6 Yup, that's correct.
 10-20-2011, 08:37 AM #5 lmc Junior Member   Location: USA Join Date: Jun 2010 Posts: 6 Although, if you want to know if you have an enrichment of a group of genes with a specific function in your subset of differentially genes, then it may be useful to you.
 10-20-2011, 08:40 AM #6 chadn737 Senior Member   Location: US Join Date: Jan 2009 Posts: 392 Why is there no point? Looking at over-representation can tell you a lot about what is going on in the data. If a particular GO category is over-represented then that process is likely to be particularly important under the conditions you are testing for. The fact that cell cycle is not over-represented tells you something as well, that maybe there is not much changing in relation to cell cycle under your conditions. Looking for enrichment has been valuable and informative in my own work.
 10-20-2011, 08:50 AM #7 lmc Junior Member   Location: USA Join Date: Jun 2010 Posts: 6 I think damiankao was trying to use enrichment analysis to determine the probability that an individual gene is differentially expressed. But, of course, enrichment analysis has no relation to to whether or not an individual gene is differentially expressed. So, in this particular context there is no point to performing this type of analysis.
 10-20-2011, 09:13 AM #8 damiankao Member   Location: UK Join Date: Jan 2010 Posts: 49 I guess I am trying to point out that over-representation analysis gives you significance relative to random chance. In the case of differential expression, there is no random chance because we are already assuming the list is correct. We are not getting significance values relative to all possible configurations of the differentially expressed list, because there is only one list. In my example with cell cycle. My differentially expressed gene list has under-representation of cell cycle. What does that mean really? Under random conditions, the probability is 0.05 to see a cell cycle gene. Are we assuming between my two conditions, it is also 0.05 to see a cell cycle gene differentially expressed?
10-20-2011, 02:42 PM   #10
damiankao
Member

Location: UK

Join Date: Jan 2010
Posts: 49

Quote:
I understand how the test works. I guess my question is does the p-value you obtain from this test useful?

In my example, there are 500 genes out of 10,000 genes that have cell cycle GO term. So the probability of getting a cell cycle gene from randomly picking a gene is 500 / 10,000 = 0.05.

So if I pick 200 genes randomly, I should be able to get 10 just by chance. So anything significantly above or below that would tell me if the term is over or under represented.

But with differential expression lists, I am not picking 200 genes randomly. I have 200 genes that I've established to be differentially expressed between two conditions by whatever test I've conducted previously. Can we really say the probability of getting a cell cycle gene in this differentially expressed gene list is 0.05 if we are not randomly choosing genes?

Let's say I am comparing two samples: normal sample vs irradiated sample. Irradiation usually screws up cell proliferation. So we expect a lot of genes involved in cell cycle to be down-regulated after irradiation.

Out of 500 possible cell cycle genes in a pool of 10,000, we picked up 300 in our differentially down-regulated list of 400 genes. The p-value for this hypergeometric test would be pretty good.

Under the assumption that 0.05 (500 / 10,000) is the probability of getting a cell cycle gene by chance, we get a good p-value. But since we are introducing two conditions into our picking of the list, we are not picking by chance. The probability of getting a cell cycle gene under our two conditions should probably be higher, making the p-value worse than it actually is.

I have no idea how to adjust the probability based on the conditions picked. I am just not sure whether using a random probability on a non-randomly picked list will tell us anything meaningful.

Sorry if it's a naive thought. Perhaps I am just over-thinking it.

10-20-2011, 02:57 PM   #11
Senior Member

Location: US

Join Date: Jan 2009
Posts: 392

Quote:
 Originally Posted by damiankao But since we are introducing two conditions into our picking of the list, we are not picking by chance. The probability of getting a cell cycle gene under our two conditions should probably be higher, making the p-value worse than it actually is. I have no idea how to adjust the probability based on the conditions picked. I am just not sure whether using a random probability on a non-randomly picked list will tell us anything meaningful.
I'm not certain this fully addresses your question, but consider more carefully what anc327 said:
Quote:
 Originally Posted by anc327 there are two sets of p-values to consider carefully. the first is often a t-test p-value cutoff used to define the differentially exp genes (the alpha). and whether or not you use a multiple test correction on top of the raw p-value
There is still some error in picking the genes that are differentially expressed. While that error may be low, <5%, lets just assume that 5% of your differentially expressed genes are false positives. I think one thing the p-value for the term enrichment addresses is the error that will be introduced by false positives in your differentially expressed genes.

Thats me speaking as a non-statistician so I have no idea if I am right in this or not.