SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Insert size important? 454andSolid De novo discovery 3 12-27-2017 02:33 AM
How to calculate P-value&FDR lynn012 RNA Sequencing 2 09-18-2011 11:51 PM
[Important] You can help SEQanswers to get published marcowanger General 0 08-14-2011 07:13 PM
Important fields of SAM file to be considered while compression nadir Bioinformatics 0 03-29-2011 10:58 PM
why is paired-end alignment support so important found Bioinformatics 1 03-03-2009 08:05 AM

Reply
 
Thread Tools
Old 01-20-2012, 01:01 PM   #1
polsum
Member
 
Location: Texas

Join Date: May 2009
Posts: 32
Default edgeR: How important is the FDR value?

Hi,

I was wondering if any one can answer my question about FDR

How exactly we should determine the cutoff limit for FDR value? Is 0.1 acceptable or 0.2? Because the number of significantly expressed genes changes dramatically even for slightest changes in FDR value. For a Publication, how much FDR is a good FDR?

If I select P.value < 0.05 and ignore FDR, I am getting around 200 differentially expressed genes. But If I use FDR <0.085 along with P value <0.05, the number drops to 65. can we publish without FDR?

Thanks in advance.
polsum is offline   Reply With Quote
Old 01-20-2012, 01:30 PM   #2
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 699
Default

So, you mean "if ((FDR < 0.085) OR (p.value < 0.05)) ?
Sounds like your fishing for the cutoff that includes the really cool thing you want to show.

You'd probably want FDR<0.05 .
Richard Finney is offline   Reply With Quote
Old 01-20-2012, 01:36 PM   #3
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

I feel that this is an appropriate contribution:

http://xkcd.com/882/
swbarnes2 is offline   Reply With Quote
Old 01-20-2012, 01:57 PM   #4
polsum
Member
 
Location: Texas

Join Date: May 2009
Posts: 32
Default

Quote:
Originally Posted by Richard Finney View Post
So, you mean "if ((FDR < 0.085) OR (p.value < 0.05)) ?
Sounds like your fishing for the cutoff that includes the really cool thing you want to show.

You'd probably want FDR<0.05 .
No I meant "if ((FDR < 0.085) AND (p.value < 0.05))...no "fishing" or "fishy" business here. I am genuinely curious, how would any one sets the cutoff? There is no consensus in published literature either.


@swbarnes2 - thanks for the link.
polsum is offline   Reply With Quote
Old 01-20-2012, 02:28 PM   #5
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 699
Default

You'll want to "bonferoni adjust" your p-values or use FDR.
Stick with < 0.05 for FDR.
Richard Finney is offline   Reply With Quote
Old 01-21-2012, 07:52 AM   #6
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 993
Default

The good thing about the false discovery rate (FDR) is that it has a clear, easily understandable, meaning. If you cut at an FDR value of 0.1 (10%), your list of significant hits has (in expectation) at most 10% false positives. So, if you get 60 genes with FDR-adjusted p value below 10%, this list will contain around 6 false ones.

The reason that there is no consensus on which FDR level to chose is that it is not asked too much to make an informed case-by-case decision what FDR might be acceptable for a given experiment, depending on the kind of conclusions one whishes to draw.

And just as a reminder: Don't even think about thresholding the raw p values in genomic experiments. This is nearly always nonsense, and I wish editors would make it a rule to simply reject papers doing that immediately instead of waiting for the referees to spot it.
Simon Anders is offline   Reply With Quote
Old 01-21-2012, 08:21 AM   #7
ETHANol
Senior Member
 
Location: Western Australia

Join Date: Feb 2010
Posts: 310
Default

Quote:
Originally Posted by Simon Anders View Post
And just as a reminder: Don't even think about thresholding the raw p values in genomic experiments. This is nearly always nonsense, and I wish editors would make it a rule to simply reject papers doing that immediately instead of waiting for the referees to spot it.
First off excuse my ignorance of statistics ... but I'm trying to get better. So here's the stupid question?

So why is it bad to threshold raw p-values? I always threshold FDR just because it makes more sense to my simplistic viewpoint.

Off topic, agree, editors should have a list of stuff that just is not allowed. My personal favorites, quantitating western blots with no standard curve and ChIPs were IgG is the only negative control.
__________________
--------------
Ethan
ETHANol is offline   Reply With Quote
Old 02-09-2012, 01:19 PM   #8
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 993
Default

Quote:
Originally Posted by ETHANol View Post
So why is it bad to threshold raw p-values?
Didn't see the question until now, but to not leave it unanswered:

Imagine your genome has 10,000 genes, You think that some of them are differentially expressed, but, in reality, none of them is. You cut your p values at 0.05.

Now remember the definition of a p value: If a test result is assigned the p value p, the probability of seeing a result this strong or stronger only due to noise (i.e., with there being no real effect) is p.

Hence, even if no genes are differentially expressed, 5% of the genes will have a p value below 5%. For 10,000 genes, these are 500.

Now. let's assume there are truly differentially expressed genes in your study. Let's say, you find 1,000 of your 10,000 genes to have a raw p value below 5%. From the argument above, you should still expect this list of 1000 genes to contain 500 false positives, i.e., your false dicovery rate is 500/1000=50%. This is clearly unacceptably large.

The Benjamini-Hochberg adjustment, which formalizes this argument, will hence adjust a raw p value of 0.05 to an adjusted p value of 0.5. In practise, you use the logic the other way round and decide on a false discovery rate that you deem acceptable, and look up which genes got an adjusted value below this.

Last edited by Simon Anders; 02-09-2012 at 01:21 PM.
Simon Anders is offline   Reply With Quote
Old 02-13-2012, 05:56 AM   #9
ETHANol
Senior Member
 
Location: Western Australia

Join Date: Feb 2010
Posts: 310
Default

Simon thank your very much for the lesson. I 'm trying to become more statistically literate at the moment. I didn't appreciate the difference between p-value and FDR.

BTW, the DESeq Bioconductor vignette is one of the few Bioconductor vignettes that make any sense whatsoever.
__________________
--------------
Ethan
ETHANol is offline   Reply With Quote
Old 01-27-2014, 03:23 AM   #10
sindrle
Senior Member
 
Location: Norway

Join Date: Aug 2013
Posts: 266
Default

That was indeed clarifying!

I agree on the DEseq2 manual, but my personal favourite is the edgeR manual still, can never be to many examples or simple explanations, in my taste.
sindrle is offline   Reply With Quote
Old 01-27-2014, 04:34 AM   #11
sindrle
Senior Member
 
Location: Norway

Join Date: Aug 2013
Posts: 266
Default

"make an informed case-by-case decision what FDR might be acceptable for a given experiment"

Could you please elaborate?

How about this case:

You run DESeq2, you pick out 10 genes you want to look at including p values.
Say 6 genes have p < 0.05.
You then use p.adjust in R.
What FDR do you choose and why?
Which n do you set?

Last edited by sindrle; 01-27-2014 at 06:03 AM.
sindrle is offline   Reply With Quote
Old 01-27-2014, 09:55 AM   #12
rskr
Senior Member
 
Location: Santa Fe, NM

Join Date: Oct 2010
Posts: 250
Default

Quote:
Originally Posted by polsum View Post
Hi,

I was wondering if any one can answer my question about FDR

How exactly we should determine the cutoff limit for FDR value? Is 0.1 acceptable or 0.2? Because the number of significantly expressed genes changes dramatically even for slightest changes in FDR value. For a Publication, how much FDR is a good FDR?

If I select P.value < 0.05 and ignore FDR, I am getting around 200 differentially expressed genes. But If I use FDR <0.085 along with P value <0.05, the number drops to 65. can we publish without FDR?

Thanks in advance.
I don't think FDR is very important for RNA-seq. For multiple hypothesis tests where each test has uniform variance and is sufficiently powered, FDR might be OK, however for the counts data FDR doesn't take into account the fact that many of the tests were negative due to insufficient coverage rather than the tests not being discernible, so FDR is confounded by the sampling methodology. IE if you had sampled 1000 genes, and the null hypothesis was rejected for 50, and of the other 950, 80% had low coverage. In theory you could sequence more from the same samples then some of the other 80% could be significant, which doesn't make sense from an FDR stand point. I think this also applies to Bonferroni. On the flip side, you could get a fabulous FDR, by simply not sequencing very much.
rskr is offline   Reply With Quote
Old 01-27-2014, 10:47 AM   #13
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,479
Default

@rskr: Given that this is in the context of DESeq2 (I realize that the thread is titled with edgeR...), low-count genes are automatically dropped and power maximized (I have to admit that it's handy to not have to do this myself anymore). So, the low-coverage genes screwing the p-values critique doesn't apply.

@sindrle: The informed decision is basically short-hand for what you want to do downstream (at least that's what I would mean had I written that...perhaps Simon means something else). If you're just interested in generally describing broad changes (e.g. in enriched GO terms) then you can be a bit more lax with the adjusted p-value cutoff. If, on the other hand, you're going to generate a bunch of transgenic mice or start a large-scale drug screen (i.e., your next step involves large amounts of time/money), then you really really need to be positive that you're not following up a spurious result. In those cases, you'd use a much lower adjusted p-value threshold. A bit of understanding of the underlying biology can also help make an informed decision here.

Other considerations could be:
1) How many hits did you find at a given threshold and how many did you expect (given preliminary data or published literature)?
2) If there are known changes, how many of those did you get at a given threshold?
3) Do you lack ethics and just want to make a nice, but likely false, story to publish in Science/Cell/Nature? Then just use raw p-values (or "better" yet, fold-changes!) and request reviewers who only understand Western blots.
dpryan is offline   Reply With Quote
Old 01-27-2014, 10:51 AM   #14
sindrle
Senior Member
 
Location: Norway

Join Date: Aug 2013
Posts: 266
Default

Edited, look below.

Last edited by sindrle; 01-27-2014 at 11:05 AM. Reason: Wrong quote..
sindrle is offline   Reply With Quote
Old 01-27-2014, 11:05 AM   #15
sindrle
Senior Member
 
Location: Norway

Join Date: Aug 2013
Posts: 266
Default

Quote:
Originally Posted by dpryan View Post
@rskr: Given that this is in the context of DESeq2 (I realize that the thread is titled with edgeR...), low-count genes are automatically dropped and power maximized (I have to admit that it's handy to not have to do this myself anymore). So, the low-coverage genes screwing the p-values critique doesn't apply.
Do you know how to do this in edgeR?


Quote:
Originally Posted by dpryan View Post
3) Do you lack ethics and just want to make a nice, but likely false, story to publish in Science/Cell/Nature? Then just use raw p-values (or "better" yet, fold-changes!) and request reviewers who only understand Western blots.
Thanks for the tip, Ill go for this one.
sindrle is offline   Reply With Quote
Old 01-27-2014, 11:07 AM   #16
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 993
Default

Quote:
Originally Posted by sindrle View Post
"
You run DESeq2, you pick out 10 genes you want to look at including p values.
Say 6 genes have p < 0.05.
You then use p.adjust in R.
What FDR do you choose and why?
Which n do you set?
If you pick the ten genes a priori, i.e., in a manner that is independent of the the outcome, the you can run p.adjust only on the p values from these 10 genes.

By a choice "a priori", I mean that you knew before doing the analysis that these genes are worth looking at and others are not. If, however, you have chosen these ten genes precisely because their expression data in this very experiment looked so interesting that you want them to be in your result list, then you need to run p.adjust on all genes.

In the former case, you only wanted to look at these genes, so your test only has to reject the null hypothesis that precisely these genes seem to have a signal that looks interesting but arose only due to chance. In the latter case, you have to reject the null hypothesis that somewhere in your data with its many genes, some of which will show strong signals merely due to chance fluctuations, there will be ten genes, which look so far out as to appear interesting. As this is much more likely to happen if it may be any 10 genes rather than a fixed set of 10 genes, therefore the signal has to be stronger to convince us that it is not mere chance. Hence the more stringent multiple-testing adjustment.
Simon Anders is offline   Reply With Quote
Old 01-27-2014, 11:09 AM   #17
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,479
Default

Quote:
Originally Posted by sindrle View Post
Do you know how to do this in edgeR?
See the "genefilter" package for some useful functions.
dpryan is offline   Reply With Quote
Old 01-27-2014, 11:11 AM   #18
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 993
Default

Quote:
Originally Posted by rskr View Post
I don't think FDR is very important for RNA-seq.
Sorry, I cannot let this stand like this, as it might be misunderstood to mean that accounting for multiple hypothesis testing is optional in RNA-Seq data analysis. Of course, you always need to account for multiple hypothesis testing when you test many hypothesis (here: many genes).

Quote:
On the flip side, you could get a fabulous FDR, by simply not sequencing very much.
Um, no, you don't. Why should you?
Simon Anders is offline   Reply With Quote
Old 01-27-2014, 11:15 AM   #19
rskr
Senior Member
 
Location: Santa Fe, NM

Join Date: Oct 2010
Posts: 250
Default

Quote:
Originally Posted by dpryan View Post
@rskr: Given that this is in the context of DESeq2 (I realize that the thread is titled with edgeR...), low-count genes are automatically dropped and power maximized (I have to admit that it's handy to not have to do this myself anymore). So, the low-coverage genes screwing the p-values critique doesn't apply.
Low coverage genes can still be significant, just not at the same rate as the higher coverage genes, though it may be possible to filter out certain genes which have zero chance of being significant, however the power to tell depends on the proportion as well as the coverage, so as I said FDR isn't so important.
rskr is offline   Reply With Quote
Old 01-27-2014, 11:17 AM   #20
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 993
Default

Quote:
Originally Posted by rskr View Post
[...], so as I said FDR isn't so important.
So, what do you suggest to do instead?
Simon Anders is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:57 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO