SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Cufflinks, differentially expressed genes statsteam Bioinformatics 5 11-15-2013 12:28 PM
Getting differentially expressed genes based on RPKM values casshyr RNA Sequencing 3 04-30-2012 10:18 AM
DESeq and EdgeR: too many differentially expressed genes!?!? cutcopy11 Bioinformatics 5 12-08-2011 01:14 AM
Comparing mouse and human differentially expressed genes stephenhart General 3 11-16-2011 02:14 AM
Detecting differentially expressed genes using aligner outputs questioner Bioinformatics 6 11-03-2011 08:15 AM

Reply
 
Thread Tools
Old 10-20-2011, 05:58 AM   #1
damiankao
Member
 
Location: UK

Join Date: Jan 2010
Posts: 49
Default gene ontology over-representation of differentially expressed genes

I am pretty new to the statistical methods used in calculating the probably of over-representation so bear with me. I've been reading this for an intro on this topic:
http://users.unimi.it/marray/2007/ma...4/Lecture7.pdf

My main question is do we care about the probability of GO over-representation in differentially expressed genes?

The p-value after an over-representation analysis is the chance that the GO term appearing in my sub-list is due to random chance. But what does "random chance" mean in the context of differential expression lists? The chance that the gene is not differentially expressed? The chance that the GO assignment was wrong?

Example case:
-I have a total gene set of 10,000 genes.
-500 of those genes have "cell cycle" GO term.
-I have a list of 200 differentially expressed genes and 10 of them are cell cycle.

If I do a simple hypergeometric test in R (according to the presentation I linked above) with:
phyper(9, 500, 10000-500, 200, lower.tail=FALSE)
I get a pretty bad p-value of: 0.55

So that p-value is telling me the 10 genes that are cell cycle in my differentially expressed list is not very significant. So if we randomly draw 200 genes from the pool of 10,000 genes, the chances of getting 10 cell cyle genes is 0.55.

What does the significance really mean in this context? There is a good chance that the 10 cell cycle genes really weren't differentially expressed in the first place? What if the 10 cell cycle genes had a very high significance in my differential expression analysis?
damiankao is offline   Reply With Quote
Old 10-20-2011, 07:40 AM   #2
lmc
Junior Member
 
Location: USA

Join Date: Jun 2010
Posts: 6
Default

It means that there is no significant enrichment of cell cycle genes in your subset of genes. As you stated, the number of cell cycle genes in your subset is what you would expect if you randomly selected 200 genes from the set of 10000.
In other words, the proportion of genes with a "cell cycle" annotation in your subset of genes is similar to (actually, in this case it is exactly the same as) the proportion of genes with a "cell cycle" annotation in the whole gene set.

10/200=0.05
500/10000=0.05

Any type of over-representation analysis has no relation to whether an individual gene is deferentially expressed. This type of analysis assumes that the genes in your differentially expressed gene list are actually differentially expressed and that the annotations are correct. And the results indicate only if your subset of genes has more of any given annotation than you would expect by chance.
lmc is offline   Reply With Quote
Old 10-20-2011, 08:28 AM   #3
damiankao
Member
 
Location: UK

Join Date: Jan 2010
Posts: 49
Default

Quote:
Any type of over-representation analysis has no relation to whether an individual gene is deferentially expressed.
Thanks. So there really is no point in using an over-representation analysis in differentially express genes. It doesn't tell you anything in relation to differential expression.
damiankao is offline   Reply With Quote
Old 10-20-2011, 08:35 AM   #4
lmc
Junior Member
 
Location: USA

Join Date: Jun 2010
Posts: 6
Default

Yup, that's correct.
lmc is offline   Reply With Quote
Old 10-20-2011, 08:37 AM   #5
lmc
Junior Member
 
Location: USA

Join Date: Jun 2010
Posts: 6
Default

Although, if you want to know if you have an enrichment of a group of genes with a specific function in your subset of differentially genes, then it may be useful to you.
lmc is offline   Reply With Quote
Old 10-20-2011, 08:40 AM   #6
chadn737
Senior Member
 
Location: US

Join Date: Jan 2009
Posts: 392
Default

Why is there no point? Looking at over-representation can tell you a lot about what is going on in the data. If a particular GO category is over-represented then that process is likely to be particularly important under the conditions you are testing for.

The fact that cell cycle is not over-represented tells you something as well, that maybe there is not much changing in relation to cell cycle under your conditions.

Looking for enrichment has been valuable and informative in my own work.
chadn737 is offline   Reply With Quote
Old 10-20-2011, 08:50 AM   #7
lmc
Junior Member
 
Location: USA

Join Date: Jun 2010
Posts: 6
Default

I think damiankao was trying to use enrichment analysis to determine the probability that an individual gene is differentially expressed. But, of course, enrichment analysis has no relation to to whether or not an individual gene is differentially expressed. So, in this particular context there is no point to performing this type of analysis.
lmc is offline   Reply With Quote
Old 10-20-2011, 09:13 AM   #8
damiankao
Member
 
Location: UK

Join Date: Jan 2010
Posts: 49
Default

I guess I am trying to point out that over-representation analysis gives you significance relative to random chance.

In the case of differential expression, there is no random chance because we are already assuming the list is correct. We are not getting significance values relative to all possible configurations of the differentially expressed list, because there is only one list.

In my example with cell cycle. My differentially expressed gene list has under-representation of cell cycle. What does that mean really? Under random conditions, the probability is 0.05 to see a cell cycle gene. Are we assuming between my two conditions, it is also 0.05 to see a cell cycle gene differentially expressed?
damiankao is offline   Reply With Quote
Old 10-20-2011, 01:41 PM   #9
anc327
Junior Member
 
Location: Here and there

Join Date: Oct 2011
Posts: 1
Default

i would agree with others here who say that functional annotation (enrichment testing) of signature/diff exp gene lists is indeed useful and is actually a basic tool built into almost every exp analysis package out there (i.e. DAVID, Ingenuity, various R packages, etc...).

there are two sets of p-values to consider carefully. the first is often a t-test p-value cutoff used to define the differentially exp genes (the alpha). and whether or not you use a multiple test correction on top of the raw p-value, in addition. note that some people will split the diff exp genes as the up-reg'd or the down-reg'd sets.

the next p-value is the Fisher's/Hypergeometric enrichment test p-value. depending on the gene list set sizes (on average 100~1000 genes) you will get a set of GO category rankings. with this hypothesis test, you will need fairly stringest cutoffs to "trust" the enrichments. and this is heavily dependent on having reasonable gene set sizes otherwise you will get misleading enrichment annotations. multiple-test corrections are also useful for these tests.

bottom line, the first task of getting diff exp genes requires one type of test and the second task of enrichment requires another. these are independent of one another but are often used in this sequence to identify potential pathways or GO lists in expression data. you may need to test various cutoffs to see how robust your enrichments are.

in your example, you are asking a basic question about what Fisher's testing is all about. there are obviously plenty of places to read up on what this is. but an intuitive interpretation is that if you randomly grabbed 200 genes out of 10000 and only got 9 that were cell cycle, by random chance you could easily get that many so you end up with an uninteresting p-value. but if you run your R test again with say 100 instead of 9, you'll get a much smaller p-value (likely < 0.05) indicating that the chance you could randomly get 100 cell cycle genes when grabbing 200 at a time is a very small probability - suggesting this is statistically significant. hope that helps.

Last edited by anc327; 10-20-2011 at 03:48 PM.
anc327 is offline   Reply With Quote
Old 10-20-2011, 02:42 PM   #10
damiankao
Member
 
Location: UK

Join Date: Jan 2010
Posts: 49
Default

Quote:
Originally Posted by anc327 View Post
i would agree with others here who say that functional annotation (enrichment testing) of signature/diff exp gene lists is indeed useful and is actually a basic analysis built into almost every exp analysis package out there (i.e. DAVID, Ingenuity, various R packages, etc...).

there are two sets of p-values to consider carefully. the first is often a t-test p-value cutoff used to define the differentially exp genes (the alpha). and whether or not you use a multiple test correction on top of the raw p-value, in addition. note that some people will split the diff exp genes as the up-reg'd or the down-reg'd sets.

the next p-value is the Fisher's/Hypergeometric enrichment test p-value. depending on the gene list set sizes (on average 100~1000 genes) you will get a set of GO category rankings. with this hypothesis test, you will need fairly stringest cutoffs to "trust" the enrichments. and this is heavily dependent on having reasonable gene set sizes otherwise you will get misleading enrichment annotations. also, multiple-test corrections are also useful for these tests.

bottom line, the first task of getting diff exp genes requires one type of test and the second task of enrichment requires another. these are independent of one another but are often used in this sequence to identify potential pathways or GO lists in expression data. you may need to test various cutoffs to see how robust your enrichments are.

in your example, you are asking a basic question about what Fisher's testing is all about. there are obviously plenty of places to read up on what this is. but an intuitive interpretation is that if you randomly grabbed 200 genes out of 10000 and only got 9 that were cell cycle, by random chance you could easily get that many. but if you run your R test again with say 100 instead of 9, you'll get a much smaller p-value (< 0.05) indicating that the chance you could get 100 cell cycle genes when grabbing 200 at a time is a small probability not due to chance. hope that helps.
I understand how the test works. I guess my question is does the p-value you obtain from this test useful?

In my example, there are 500 genes out of 10,000 genes that have cell cycle GO term. So the probability of getting a cell cycle gene from randomly picking a gene is 500 / 10,000 = 0.05.

So if I pick 200 genes randomly, I should be able to get 10 just by chance. So anything significantly above or below that would tell me if the term is over or under represented.

But with differential expression lists, I am not picking 200 genes randomly. I have 200 genes that I've established to be differentially expressed between two conditions by whatever test I've conducted previously. Can we really say the probability of getting a cell cycle gene in this differentially expressed gene list is 0.05 if we are not randomly choosing genes?

Let's say I am comparing two samples: normal sample vs irradiated sample. Irradiation usually screws up cell proliferation. So we expect a lot of genes involved in cell cycle to be down-regulated after irradiation.

Out of 500 possible cell cycle genes in a pool of 10,000, we picked up 300 in our differentially down-regulated list of 400 genes. The p-value for this hypergeometric test would be pretty good.

Under the assumption that 0.05 (500 / 10,000) is the probability of getting a cell cycle gene by chance, we get a good p-value. But since we are introducing two conditions into our picking of the list, we are not picking by chance. The probability of getting a cell cycle gene under our two conditions should probably be higher, making the p-value worse than it actually is.

I have no idea how to adjust the probability based on the conditions picked. I am just not sure whether using a random probability on a non-randomly picked list will tell us anything meaningful.

Sorry if it's a naive thought. Perhaps I am just over-thinking it.
damiankao is offline   Reply With Quote
Old 10-20-2011, 02:57 PM   #11
chadn737
Senior Member
 
Location: US

Join Date: Jan 2009
Posts: 392
Default

Quote:
Originally Posted by damiankao
But since we are introducing two conditions into our picking of the list, we are not picking by chance. The probability of getting a cell cycle gene under our two conditions should probably be higher, making the p-value worse than it actually is.

I have no idea how to adjust the probability based on the conditions picked. I am just not sure whether using a random probability on a non-randomly picked list will tell us anything meaningful.
I'm not certain this fully addresses your question, but consider more carefully what anc327 said:
Quote:
Originally Posted by anc327
there are two sets of p-values to consider carefully. the first is often a t-test p-value cutoff used to define the differentially exp genes (the alpha). and whether or not you use a multiple test correction on top of the raw p-value
There is still some error in picking the genes that are differentially expressed. While that error may be low, <5%, lets just assume that 5% of your differentially expressed genes are false positives. I think one thing the p-value for the term enrichment addresses is the error that will be introduced by false positives in your differentially expressed genes.

Thats me speaking as a non-statistician so I have no idea if I am right in this or not.
chadn737 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:07 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO