SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
edgeR: How important is the FDR value? polsum Bioinformatics 23 02-07-2014 08:58 AM
Nice p-value, bad FDR Jcl Epigenetics 5 08-30-2013 06:41 AM
edgeR: number of DGE genes and FDR?? SpreeFu Bioinformatics 4 03-12-2013 08:54 AM
samseq in samr fdr cutoff narges Bioinformatics 0 12-12-2012 07:13 AM
How to calculate P-value&FDR lynn012 RNA Sequencing 2 09-18-2011 10:51 PM

Reply
 
Thread Tools
Old 11-04-2013, 07:07 AM   #1
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 700
Default R mateys! Which is Best FDR ?

This is sort of a general question , what is the best FDR ( false discovery rate) tool?

I'm not sure which algorithm or "flavor" or whatever is considered "best".

I note this page : http://www.strimmerlab.org/notes/fdr.html which lists fdrtool, mixfdr bum/sploshi sagx qvalue fdrci nfdr multest, lbe, fdr-ame locfdr nomi localfdr kerfdrtwilight localfdr . WTHeck?

So, more specifically, which one is most widely used? Which is easiest? Which is fastest? Which is considered the bestest guesstimator or whatever for "false discovery rate".

BACKGROUND: I have a HUGE data set; we're talking terabytes. I can crank out the p-values, billions of them. I can torture our "friend" Bonferonni but he's not going to give me the answer I want, so I'm going to employ his alter-ego , FDR ("false discovery rate").

I'm guessing R is going to choke on the input size, so I'll have to swipe the code and re-implement in C and pull tricks to fit it into memory and make it so that it runs in less than a month. If I'm going to hijack some code [ keeping the GPL , of course ] , I need to know which is the best.
Richard Finney is offline   Reply With Quote
Old 11-04-2013, 10:22 AM   #2
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 838
Default

That depends on what your data looks like. Your "background" doesn't really help in establishing this. All it tells me is that you are probably working with measured or simulated data, rather than survey response data.

If your data is aligned sequences from actual biological data, you're going to have a hard time finding out the true false discovery rate due to the behaviour of sequenced data. If your data is non-normal, then there are a bunch of commonly-used parametric tests that won't work. If your data is multidimensional, then you're probably going to need to reduce the dimensionality prior to analysis, which again has issues regarding the assumptions used for that reduction. If your data comprises a single quantitative value for each datapoint with a threshold chosen for what "positive" means, then the false discovery rate will change depending on the threshold, as well as what you consider greyzone (neither negative or positive)....

I don't think this question can be appropriately answered in the general sense.
gringer is offline   Reply With Quote
Old 11-04-2013, 11:28 AM   #3
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 700
Default

A previous seqanswers thread suggested using p.adjust in R [ http://seqanswers.com/forums/showthread.php?t=14114 ].
The implementation of p.adjust is in this code [ in src/library/stats/R/p.adjust.R ] ...
Code:
p.adjust <- function(p, method = p.adjust.methods, n = length(p))
{
    ## Methods 'Hommel', 'BH', 'BY' and speed improvements
    ## contributed by Gordon Smyth
    method <- match.arg(method)
    if(method == "fdr") method <- "BH"  # back compatibility
    nm <- names(p)
    p <- as.numeric(p)
    p0 <- setNames(p, nm)
    if(all(nna <- !is.na(p))) nna <- TRUE
    p <- p[nna]
    lp <- length(p)
    stopifnot(n >= lp)
    if (n <= 1) return(p0)
    if (n == 2 && method == "hommel") method <- "hochberg"

    p0[nna] <-
        switch(method,
               bonferroni = pmin(1, n * p),
               holm = {
                   i <- seq_len(lp)
                   o <- order(p)
                   ro <- order(o)
                   pmin(1, cummax( (n - i + 1L) * p[o] ))[ro]
               },
               hommel = { ## needs n-1 >= 2 in for() below
                   if(n > lp) p <- c(p, rep.int(1, n-lp))
                   i <- seq_len(n)
                   o <- order(p)
                   p <- p[o]
                   ro <- order(o)
                   q <- pa <- rep.int( min(n*p/i), n)
                   for (j in (n-1):2) {
                       ij <- seq_len(n-j+1)
                       i2 <- (n-j+2):n
                       q1 <- min(j*p[i2]/(2:j))
                       q[ij] <- pmin(j*p[ij], q1)
                       q[i2] <- q[n-j+1]
                       pa <- pmax(pa,q)
                   }
                   pmax(pa,p)[if(lp < n) ro[1:lp] else ro]
               },
               hochberg = {
                   i <- lp:1L
                   o <- order(p, decreasing = TRUE)
                   ro <- order(o)
                   pmin(1, cummin( (n - i + 1L) * p[o] ))[ro]
               },
               BH = {
                   i <- lp:1L
                   o <- order(p, decreasing = TRUE)
                   ro <- order(o)
                   pmin(1, cummin( n / i * p[o] ))[ro]
               },
               BY = {
                   i <- lp:1L
                   o <- order(p, decreasing = TRUE)
                   ro <- order(o)
                   q <- sum(1L/(1L:n))
                   pmin(1, cummin(q * n / i * p[o]))[ro]
               },
               none = p)
    p0
}
This doesn't look bad at all. This looks quite hackable.
So I guess the real question is which one is best ...

"holm", "hochberg", "hommel", "BH", "BY", or "fdr" ? (not "bonferroni" ) The BH is Benjamini & Hochberg and BY is Benjamini & Yekutiel.

The data is p-values from a t-test, one dimension. This is asking are there differences between two measurements for a position on the genome for many samples. Many positions are interrogated. It is indeed measured. It is not survey or simulated.

The data are observations are largely independent at a distance; observations nearby may not be (examples: linkage disequilibrium and near by gene co-expression phenomena).

Last edited by Richard Finney; 11-04-2013 at 11:56 AM.
Richard Finney is offline   Reply With Quote
Old 11-04-2013, 11:52 PM   #4
dariober
Senior Member
 
Location: Cambridge, UK

Join Date: May 2010
Posts: 311
Default

Quote:
Originally Posted by Richard Finney View Post
So, more specifically, which one is most widely used?
Hi- In microarray, RNA-Seq and other cases where multiple testing is involved the most widely used correction is probably Benjamini Hochberg which is the default in limma, edgeR, and DESeq among other packages. Obviously this doesn't mean that it will work well in your case.

I think in general one should consider how many tests are expected to be true positives and how "bad" it would be to carry on some false positives to further analyses. In gene expression studies usually you expect several genes to be affected by your treatment(s) and for gene ontology or pathway analyses it is not too bad to have some false positives, so BH is fine. On the other hand, if you expect just one or few true positives (e.g. a Mendelian trait controlled by a single gene in a mapping exercise), maybe a more conservative correction is preferred.

Dario
dariober is offline   Reply With Quote
Old 11-05-2013, 01:14 AM   #5
bruce01
Senior Member
 
Location: .

Join Date: Mar 2011
Posts: 157
Default

Whats the input size that you think is going to choke R?! Can you give us an idea on number of samples, number of 'genes' or whatever proxy you will use if unannotated? Is this an annotated genome? Is it just straight forward DE analysis?

To answer the question, I have always used p.adjust, BH, testing 0.1, 0.05, 0.01.
bruce01 is offline   Reply With Quote
Old 11-05-2013, 01:41 AM   #6
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 838
Default

Quote:
Whats the input size that you think is going to choke R?!
As an additional point of information regarding this, R can probably read terabyte-sized files as long as the input is streamed and processed per chunk. It might take a long time, but I doubt it would choke R. Built into the core of R is a file handle facility which allows you to only read a portion of a file at one time. If you're not doing pairwise tests (and with your dataset it's unlikely you can do that), then presumably there will be a way to process the dataset and generate the results you desire without loading the entire file into memory.
gringer is offline   Reply With Quote
Old 11-05-2013, 02:19 AM   #7
dariober
Senior Member
 
Location: Cambridge, UK

Join Date: May 2010
Posts: 311
Default

Quote:
Originally Posted by gringer View Post
As an additional point of information regarding this, R can probably read terabyte-sized files
Bye the way, the ff package might also be useful here, from the description:

Quote:
The ff package provides data structures that are stored on disk but behave (almost) as if they were in RAM by transparently mapping only a section (pagesize) in main memory
I think ff has been used extensively for big projects. Although I think that to apply one of the adjustment methods one has to read the whole vector of pvalues in memory at some point.
dariober is offline   Reply With Quote
Old 11-05-2013, 03:17 AM   #8
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,077
Default

Irrespective of what solution is chosen this is an interesting problem from the infrastructure side.

I am interested in hearing about what sort of hardware you are planning to run this on.
GenoMax is offline   Reply With Quote
Old 11-05-2013, 04:26 AM   #9
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 700
Default

I'm wanting to generate measurements for every location in human genome, so 3 billion inputs; I'll have 3 billion p-values. I have 64GB machine. 3*8bytes for double = 24GB so input might fit; it looks like p.adjust will need a couple of support arrays, so .... I'll try and watch it bomb just to see. So, alternative is use floats and just implement it in C; the actuall algorithm looks pretty straight forward.

Indeed R might not choke on a 256 GB memory machine, so moving it to a more powerful machine might be the thing to do.
Richard Finney is offline   Reply With Quote
Old 11-05-2013, 06:56 AM   #10
rskr
Senior Member
 
Location: Santa Fe, NM

Join Date: Oct 2010
Posts: 250
Default

Quote:
Originally Posted by Richard Finney View Post
I'm wanting to generate measurements for every location in human genome, so 3 billion inputs; I'll have 3 billion p-values. I have 64GB machine. 3*8bytes for double = 24GB so input might fit; it looks like p.adjust will need a couple of support arrays, so .... I'll try and watch it bomb just to see. So, alternative is use floats and just implement it in C; the actuall algorithm looks pretty straight forward.

Indeed R might not choke on a 256 GB memory machine, so moving it to a more powerful machine might be the thing to do.
I would use command line tools, and add the 20 grand in memory to my bonus:

wc -l file

cat file| sort -k1,1| awk -F"\t" '{alpha=.005;if($1>alpha*NR/line_number)break; else print $0)}'

where the line_number is the value returned by wc -l for the length of the file, assuming that each line has a pvalue in the first column. Should run in n*log(n) time.
rskr is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:48 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO