Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Testing/benchmarking RNASeq DE tools

    I am performing DE analysis of RNAseq datasets using DESeq, edgeR, and CuffDiff. While DESeq and edger (based on very similar models) agree for the most part, I have found that CuffDiff gives very different results.

    In the absence of a synthetic "truth dataset" for RNAseq (a simulated dataset where I know a priori the set of differentially expressed genes), I am unable to decide which package is more sensitive/accurate. I am actually quite amazed that tools which are extensively used by the RNAseq community have not been tested in this manner.

    Does anybody know where I can find such a dataset or a tool that can generate one???

  • #2
    I've commented on this subject several times before, but as this is scattered through the forums, I'll try a comprehensive answer. My apologies to advanced readers for repeating many basic facts.

    The reason you see such strong differences between DESeq and edgeR on the one hand and cuffdiff on the other hand is not because of the different methods used. Rather, there are two fundamentally different questions you may ask when analysing expression data:

    (A) Is the concentration of a given transcripts different in two samples?

    (B) May this difference be attributed to the difference in experimental conditions, i.e., may we be confident that it is due to the experimental treatment and not due to fluctuations not under the experimenter's control (not due to "biological variation")?


    Allow me to discuss these in details:

    (A) Given two samples, you measure the concentration of your transcript of interest by some technique (here RNA-Seq, but could also be microarray and RNA-Seq). You know that your measurement process has an inherent uncertainty ("technical noise"), and hence, even if the two samples are actually aliquots of the same processed library you do not expect to get two exactly equal measurements. If you know how strong the technical noise is, you can calculate how much difference you need to see between the samples before you can reject the null hypothesis that the transcript has the same concentration in both samples.

    To find out about the strength of technical noise, you could compare technical replicates, i.e., two measurements of aliquots from the same sample. However, specifically for RNA-Seq, it is established that technical noise does not depend much on the experimental details. If all goes well, it is hardly above the theoretical limit imposed by the so-called Poisson or shot noise. This is the core result of the 2008 Genome Res. paper by Marioni et al. Hence, you do not need to compare technical replicates; even with a single sample, you can calculate the measurement uncertainty from theoretical considerations. If you compare whole genes, this calculation is simple: we expect the technical variance to be equal to the mean. Whenever a paper mentions Fisher's exact test, the hypergeometric test or a likelihood-ratio test, they are likely to refer to a variant of doing this. If you want to not compare just whole genes but distinguish isoforms, things get more complicated because the isoform deconvolution intrdouces further uncertainty. Taking this into account is what cuffdiff does.


    (B) In RNA-Seq, we are typically interested in the effect of a treatment on a cell. (Treatment can be anything, from a drug to knocking down a gene, transfecting in a mutation, or just looking at another tissue type.) If you have a control and a treatment sample, you may use one of the tests just outlined to establish that your favorite gene has different expression in the two samples. This, however, does not mean that it is reacting to your treatment. Maybe the gene reacted to some environmental fluctuation, such as minor differences in temperature or nutritient concentration.

    This is why you need biological replicates. There are countless small differences that the experimenter cannot control, and these will cause changes in gene expression that could be mistaken for effects of the treatment. Hence, the standard recipe is this: Do your experiment several times without changing the treatment ("biological replicates"), compare them to find out how strong genes vary even though you tried to keep everything the same, and then consider only those genes as effected by the treatment, for which the difference between differently treated samples ("between-group variation" in ANOVA terminology) is significantly larger thane the difference between samples treated identically ("within-group variation").

    This is what DESeq, edgeR and BaySeq focus on. We try to get a good estimate of the biological variability, and to keep things simple, we do not try isoform deconvolution.


    A good way to summerize this issue is this: Given two replicates, if you compare one against the other, do you want your tool to report significant differences? Cuffdiff addresses question (A) and hence will strive to not report differences between techical replicates but will report differences between biological ones. EdgeR, DESeq and BaySeq address question (B), i.e., they will strive to report no differences even when comparing two biological replicates with each other. (See this post, where marcora did the comparison for his data.)


    This distinction was not that crucial in the beginning. This is because for low count rates, technical noise dominates and is so much stronger than biological noise that the answers to both questions are typically the same. However, once you have more than, say, around a hundred counts for a gene, technical noise becomes smaller than biological noise, and it will be essential to test for question (B), as you will find that the answer to (A) is nearly always yes for strongly expressed genes. (Given enough counts, you see the weak differences that are always there.)


    Lior Pachter (lpachter) and I had a debate here on SeqAnswers a while ago, in which Lior strongly disagreed with my claim, that I just reiterated, that the assessment of biological variation and hence this distinction and is crucial. (I tried to link to the old thread with the debate but couldn't find it, sorry.)

    I stick to my points, though: Performing experiments without replications will not and cannot generate results with a valid biological interpretation, and the fact that so many papers without replicates have been published recently does not make this any better. And any method which does not attempt to assess biological variability (and hence does not need replicates) should not be used, unless one has a good reason to be explicitly interested in question (A), not (B).


    To come back to the original question about benchmarking methods for testing for differential expression: A test data set with gold standard should hence use biological replicates, not technical one. A comparison of technical replicates will miss one of the crucial difficulties, namely how to accurately estimate variance from few replicates. This is why I consider the paper by Bullard et al. (BMC Bioinf, 2010, 11:94) that did this as not helpful.

    The main difficulty in DE analysis for RNA-Seq is to get the variance estimate rate right even when working with few replicates. If you had many replicates, you can use well established techniques from traditional inferential statistics. Hence, we would need a large data set with two conditions and maybe around 20 or so biological replicates for each condition. This would allow to establish a ground truth, and one could then challenge the tools to recover this ground truth if provided with only a subset, say, only two or three samples per condition.
    Last edited by Simon Anders; 04-18-2011, 12:18 AM.

    Comment


    • #3
      Dear Simon,

      thank you for your answer. I really appreciate your contributions to this forum, which make bench biologists like me understand better the statistical ground of the tools we use for RNASeq.

      Specifically, I see your point about A vs B kind of scenarios. Recently CuffDiff implemented some changes that make it able to handle replicates... but it is not clear to me whether their algorithm addresses the B scenario like deseq/edger/bayseq do. This is why I was inquiring about a "truth" dataset to be able to test the output of these tools. Until somebody comes up with a test dataset including 20 biological replicates, do you know of any tool (similar to fluxsimulator) that can generate one "in silico"???

      Thanx!!!

      Comment


      • #4
        I would say that, no, the new cuffdiff still addresses scenario A. I conclude this from the fact that there is no mention of an estimation of biological variance in the new cufflinks paper (Roberts et al., Genome Biology 2011, 12:R22). Rather, the paper explicitly states that their aim is to get an accurate figure for the actual abundance of a transcript in sample (i.e., they work in the framework of what I called question A), and they argue that biases that distort the estimate and should be corrected for may be better estimated if the algorithm is informed about which sample belongs to which treatment group. Their changes seem to be only about cufflinks, not cuffdiff.

        Comment


        • #5
          Originally posted by Simon Anders View Post
          I would say that, no, the new cuffdiff still addresses scenario A. I conclude this from the fact that there is no mention of an estimation of biological variance in the new cufflinks paper (Roberts et al., Genome Biology 2011, 12:R22). Rather, the paper explicitly states that their aim is to get an accurate figure for the actual abundance of a transcript in sample (i.e., they work in the framework of what I called question A), and they argue that biases that distort the estimate and should be corrected for may be better estimated if the algorithm is informed about which sample belongs to which treatment group. Their changes seem to be only about cufflinks, not cuffdiff.
          Have you read this http://cufflinks.cbcb.umd.edu/howitworks.html#hdif, I did but I cannot quite understand it (I am not a biostatistician).

          Going back to the original question... do you know of something like fluxsimulator that can generate rnaseq dataset where the differentially expressed species are known in advance???

          Comment


          • #6
            Concerning the in-silico simulation: I have a simple script that produces a count table of simulated data and if you use it to benchmark methods, DESeq comes out best. This is not at all surprising. When developing a method, I propose a statistical model of which I think that it captures well the salient features of what is happening in real data, and if I write a simulation, I will, of course, use the same model. If another tool assumes another model it will perform worse than my tool when tested with data from my simulation.

            By this I mean: The difference between methods can be either about how to implement a test for a given model, or about the question for what model to implement a test. Simulation only helps to compare two different tests developed for the same model assumption. To test which model is more appropriate you need real data.

            I see shortcomings in both our (DESeq's) and edgeR's assumption about how real biological variance looks like. Whether these matter in practice can only be checked with (lots of) real data.
            Last edited by Simon Anders; 04-18-2011, 01:49 AM. Reason: clarified wording

            Comment


            • #7
              Originally posted by marcora View Post
              Isoform deconvolution causes extra uncertainty because if a read maps to an exon shared by several isoforms, you don't know to which to assign it. This counts as technical noise, not biological noise. (You want to know how close you are to the actual concentration of the isoforms in the specific sample under consideration, not how close you are to the population average over all samples from the same condition.) The Jensen-Shannon divergence is a way to assess this uncertainty. This is all clearly "question A".

              Comment


              • #8
                Originally posted by Simon Anders View Post
                Isoform deconvolution causes extra uncertainty because if a read maps to an exon shared by several isoforms, you don't know to which to assign it. This counts as technical noise, not biological noise. (You want to know how close you are to the actual concentration of the isoforms in the specific sample under consideration, not how close you are to the population average over all samples from the same condition.) The Jensen-Shannon divergence is a way to assess this uncertainty. This is all clearly "question A".
                Got it! Thanx a lot for your valuable help.

                Comment


                • #9
                  Originally posted by Simon Anders View Post
                  Concerning the in-silico simulation: I have a simple script that produces a count table of simulated data and if you use it to benchmark methods, DESeq comes out best. This is not at all surprising. When developing a method, I propose a statistical model of which I think that it captures well the salient features of what is happening in real data, and if I write a simulation, I will, of course, use the same model. If another tool assumes another model it will perform worse than my tool when tested with data from my simulation.

                  By this I mean: The difference between methods can be either about how to implement a test for a given model, or about the question for what model to implement a test. Simulation only helps to compare two different tests developed for the same model assumption. To test which model is more appropriate you need real data.

                  I see shortcomings in both our (DESeq's) and edgeR's assumption about how real biological variance looks like. Whether these matter in practice can only be checked with (lots of) real data.
                  I was thinking more in the line of a tool that can generate a transcriptome of known composition (with, e.g., 100 defined genes differentially expressed at various levels), and then simulate the sequencing of this transcriptome to generate actual reads. This could be then fed into something like tophat > htseq-count > deseq or tophat > cuffdiff and then compare the output list of DEGs with the one used as input for the transcriptome generation algorithm.

                  Comment


                  • #10
                    I analysed my data with edgeR and I got 490 differentially expressed genes(300 upregulated and 190 down regulated). I am little bit confused. I am not sure the cut off value used to categorise up and down regulation. When I checked the fold changes, I could not find. If I want to do a pathway analysis, I am not sure which value I have to use ie fold change or p value. Can any body help me? I am not good is statistics.

                    Comment


                    • #11
                      Originally posted by marcora View Post
                      I am performing DE analysis of RNAseq datasets using DESeq, edgeR, and CuffDiff. While DESeq and edger (based on very similar models) agree for the most part, I have found that CuffDiff gives very different results.

                      In the absence of a synthetic "truth dataset" for RNAseq (a simulated dataset where I know a priori the set of differentially expressed genes), I am unable to decide which package is more sensitive/accurate. I am actually quite amazed that tools which are extensively used by the RNAseq community have not been tested in this manner.

                      Does anybody know where I can find such a dataset or a tool that can generate one???
                      Hello marcora.

                      It is a pretty old thread but relevant for evaluating many RNA-seq statistical methods. I was wondering if you were successful in your searches for benchmark dataset to evaluate any methods for differential expression detection?

                      Precisely: we need some artificial, BAM files in which we decide which hypothetical genes will turn out differentially expressed, after processing the BAM files. Those files do not need to reflect any real biological behavior.
                      Sergiusz Wesolowski

                      Comment


                      • #12
                        DESeq provides a test data set "pasilla," described in the vignette. Not sure if that's what you are looking for.

                        Comment


                        • #13
                          @greigite

                          Thank you for your response, but unfortunately, if I understand correctly what "pasilla" dataset is, it is not what I am looking for.

                          I was going through it now (didn't know the dataset existed before) and the problem is that: we do not know which genes in this dataset are truly differentially expressed between conditions - this is the main property of a dataset we want to have.

                          Secondly the dataset contains "count data" which means we will not be able to use it for any comparisons with RPKM/FPKM based methods, particularly cuffdiff
                          Sergiusz Wesolowski

                          Comment


                          • #14
                            Originally posted by wesserg View Post
                            Hello marcora.

                            It is a pretty old thread but relevant for evaluating many RNA-seq statistical methods. I was wondering if you were successful in your searches for benchmark dataset to evaluate any methods for differential expression detection?

                            Precisely: we need some artificial, BAM files in which we decide which hypothetical genes will turn out differentially expressed, after processing the BAM files. Those files do not need to reflect any real biological behavior.
                            The best I could find is in here http://bioserver.hci.utah.edu/TestDatasets/

                            Comment


                            • #15
                              i want add

                              -SAMseq,
                              -PoissonSeq, and
                              -npseq
                              (all from the entourage of robert tibshirani)



                              they all work with permutations (non-parametric) and not with a statistical model (i.e. poisson, neg. binomial) and are therefore not sensitive to co-regulated genes.
                              in my hands they worked best for a experimental design with 12 biological replicates.

                              and some of them are able to use mached pair designs, more than two groups, right-censored outcomes (OS, DFS, PFS), and quantitative outcomes.

                              i don't understand why these packages are so unknown...

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Recent Innovations in Spatial Biology
                                by seqadmin


                                Spatial biology is an exciting field that encompasses a wide range of techniques and technologies aimed at mapping the organization and interactions of various biomolecules in their native environments. As this area of research progresses, new tools and methodologies are being introduced, accompanied by efforts to establish benchmarking standards and drive technological innovation.

                                3D Genomics
                                While spatial biology often involves studying proteins and RNAs in their...
                                01-01-2025, 07:30 PM
                              • seqadmin
                                Advancing Precision Medicine for Rare Diseases in Children
                                by seqadmin




                                Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                                12-16-2024, 07:57 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 01-09-2025, 04:04 PM
                              0 responses
                              433 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 01-09-2025, 09:42 AM
                              0 responses
                              441 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 01-08-2025, 03:17 PM
                              0 responses
                              457 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 01-03-2025, 11:18 AM
                              1 response
                              50 views
                              1 like
                              Last Post Tonia
                              by Tonia
                               
                              Working...
                              X