Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • R mateys! Which is Best FDR ?

    This is sort of a general question , what is the best FDR ( false discovery rate) tool?

    I'm not sure which algorithm or "flavor" or whatever is considered "best".

    I note this page : http://www.strimmerlab.org/notes/fdr.html which lists fdrtool, mixfdr bum/sploshi sagx qvalue fdrci nfdr multest, lbe, fdr-ame locfdr nomi localfdr kerfdrtwilight localfdr . WTHeck?

    So, more specifically, which one is most widely used? Which is easiest? Which is fastest? Which is considered the bestest guesstimator or whatever for "false discovery rate".

    BACKGROUND: I have a HUGE data set; we're talking terabytes. I can crank out the p-values, billions of them. I can torture our "friend" Bonferonni but he's not going to give me the answer I want, so I'm going to employ his alter-ego , FDR ("false discovery rate").

    I'm guessing R is going to choke on the input size, so I'll have to swipe the code and re-implement in C and pull tricks to fit it into memory and make it so that it runs in less than a month. If I'm going to hijack some code [ keeping the GPL , of course ] , I need to know which is the best.

  • #2
    That depends on what your data looks like. Your "background" doesn't really help in establishing this. All it tells me is that you are probably working with measured or simulated data, rather than survey response data.

    If your data is aligned sequences from actual biological data, you're going to have a hard time finding out the true false discovery rate due to the behaviour of sequenced data. If your data is non-normal, then there are a bunch of commonly-used parametric tests that won't work. If your data is multidimensional, then you're probably going to need to reduce the dimensionality prior to analysis, which again has issues regarding the assumptions used for that reduction. If your data comprises a single quantitative value for each datapoint with a threshold chosen for what "positive" means, then the false discovery rate will change depending on the threshold, as well as what you consider greyzone (neither negative or positive)....

    I don't think this question can be appropriately answered in the general sense.

    Comment


    • #3
      A previous seqanswers thread suggested using p.adjust in R [ http://seqanswers.com/forums/showthread.php?t=14114 ].
      The implementation of p.adjust is in this code [ in src/library/stats/R/p.adjust.R ] ...
      Code:
      p.adjust <- function(p, method = p.adjust.methods, n = length(p))
      {
          ## Methods 'Hommel', 'BH', 'BY' and speed improvements
          ## contributed by Gordon Smyth
          method <- match.arg(method)
          if(method == "fdr") method <- "BH"  # back compatibility
          nm <- names(p)
          p <- as.numeric(p)
          p0 <- setNames(p, nm)
          if(all(nna <- !is.na(p))) nna <- TRUE
          p <- p[nna]
          lp <- length(p)
          stopifnot(n >= lp)
          if (n <= 1) return(p0)
          if (n == 2 && method == "hommel") method <- "hochberg"
      
          p0[nna] <-
              switch(method,
                     bonferroni = pmin(1, n * p),
                     holm = {
                         i <- seq_len(lp)
                         o <- order(p)
                         ro <- order(o)
                         pmin(1, cummax( (n - i + 1L) * p[o] ))[ro]
                     },
                     hommel = { ## needs n-1 >= 2 in for() below
                         if(n > lp) p <- c(p, rep.int(1, n-lp))
                         i <- seq_len(n)
                         o <- order(p)
                         p <- p[o]
                         ro <- order(o)
                         q <- pa <- rep.int( min(n*p/i), n)
                         for (j in (n-1):2) {
                             ij <- seq_len(n-j+1)
                             i2 <- (n-j+2):n
                             q1 <- min(j*p[i2]/(2:j))
                             q[ij] <- pmin(j*p[ij], q1)
                             q[i2] <- q[n-j+1]
                             pa <- pmax(pa,q)
                         }
                         pmax(pa,p)[if(lp < n) ro[1:lp] else ro]
                     },
                     hochberg = {
                         i <- lp:1L
                         o <- order(p, decreasing = TRUE)
                         ro <- order(o)
                         pmin(1, cummin( (n - i + 1L) * p[o] ))[ro]
                     },
                     BH = {
                         i <- lp:1L
                         o <- order(p, decreasing = TRUE)
                         ro <- order(o)
                         pmin(1, cummin( n / i * p[o] ))[ro]
                     },
                     BY = {
                         i <- lp:1L
                         o <- order(p, decreasing = TRUE)
                         ro <- order(o)
                         q <- sum(1L/(1L:n))
                         pmin(1, cummin(q * n / i * p[o]))[ro]
                     },
                     none = p)
          p0
      }
      This doesn't look bad at all. This looks quite hackable.
      So I guess the real question is which one is best ...

      "holm", "hochberg", "hommel", "BH", "BY", or "fdr" ? (not "bonferroni" ) The BH is Benjamini & Hochberg and BY is Benjamini & Yekutiel.

      The data is p-values from a t-test, one dimension. This is asking are there differences between two measurements for a position on the genome for many samples. Many positions are interrogated. It is indeed measured. It is not survey or simulated.

      The data are observations are largely independent at a distance; observations nearby may not be (examples: linkage disequilibrium and near by gene co-expression phenomena).
      Last edited by Richard Finney; 11-04-2013, 12:56 PM.

      Comment


      • #4
        Originally posted by Richard Finney View Post
        So, more specifically, which one is most widely used?
        Hi- In microarray, RNA-Seq and other cases where multiple testing is involved the most widely used correction is probably Benjamini Hochberg which is the default in limma, edgeR, and DESeq among other packages. Obviously this doesn't mean that it will work well in your case.

        I think in general one should consider how many tests are expected to be true positives and how "bad" it would be to carry on some false positives to further analyses. In gene expression studies usually you expect several genes to be affected by your treatment(s) and for gene ontology or pathway analyses it is not too bad to have some false positives, so BH is fine. On the other hand, if you expect just one or few true positives (e.g. a Mendelian trait controlled by a single gene in a mapping exercise), maybe a more conservative correction is preferred.

        Dario

        Comment


        • #5
          Whats the input size that you think is going to choke R?! Can you give us an idea on number of samples, number of 'genes' or whatever proxy you will use if unannotated? Is this an annotated genome? Is it just straight forward DE analysis?

          To answer the question, I have always used p.adjust, BH, testing 0.1, 0.05, 0.01.

          Comment


          • #6
            Whats the input size that you think is going to choke R?!
            As an additional point of information regarding this, R can probably read terabyte-sized files as long as the input is streamed and processed per chunk. It might take a long time, but I doubt it would choke R. Built into the core of R is a file handle facility which allows you to only read a portion of a file at one time. If you're not doing pairwise tests (and with your dataset it's unlikely you can do that), then presumably there will be a way to process the dataset and generate the results you desire without loading the entire file into memory.

            Comment


            • #7
              Originally posted by gringer View Post
              As an additional point of information regarding this, R can probably read terabyte-sized files
              Bye the way, the ff package might also be useful here, from the description:

              The ff package provides data structures that are stored on disk but behave (almost) as if they were in RAM by transparently mapping only a section (pagesize) in main memory
              I think ff has been used extensively for big projects. Although I think that to apply one of the adjustment methods one has to read the whole vector of pvalues in memory at some point.

              Comment


              • #8
                Irrespective of what solution is chosen this is an interesting problem from the infrastructure side.

                I am interested in hearing about what sort of hardware you are planning to run this on.

                Comment


                • #9
                  I'm wanting to generate measurements for every location in human genome, so 3 billion inputs; I'll have 3 billion p-values. I have 64GB machine. 3*8bytes for double = 24GB so input might fit; it looks like p.adjust will need a couple of support arrays, so .... I'll try and watch it bomb just to see. So, alternative is use floats and just implement it in C; the actuall algorithm looks pretty straight forward.

                  Indeed R might not choke on a 256 GB memory machine, so moving it to a more powerful machine might be the thing to do.

                  Comment


                  • #10
                    Originally posted by Richard Finney View Post
                    I'm wanting to generate measurements for every location in human genome, so 3 billion inputs; I'll have 3 billion p-values. I have 64GB machine. 3*8bytes for double = 24GB so input might fit; it looks like p.adjust will need a couple of support arrays, so .... I'll try and watch it bomb just to see. So, alternative is use floats and just implement it in C; the actuall algorithm looks pretty straight forward.

                    Indeed R might not choke on a 256 GB memory machine, so moving it to a more powerful machine might be the thing to do.
                    I would use command line tools, and add the 20 grand in memory to my bonus:

                    wc -l file

                    cat file| sort -k1,1| awk -F"\t" '{alpha=.005;if($1>alpha*NR/line_number)break; else print $0)}'

                    where the line_number is the value returned by wc -l for the length of the file, assuming that each line has a pvalue in the first column. Should run in n*log(n) time.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM
                    • seqadmin
                      Techniques and Challenges in Conservation Genomics
                      by seqadmin



                      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                      Avian Conservation
                      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                      03-08-2024, 10:41 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Yesterday, 06:37 PM
                    0 responses
                    11 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, Yesterday, 06:07 PM
                    0 responses
                    10 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-22-2024, 10:03 AM
                    0 responses
                    51 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-21-2024, 07:32 AM
                    0 responses
                    68 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X