SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
ART simulator for SNP detection benchmarking yhong Bioinformatics 0 09-23-2011 01:03 AM
PubMed: A novel and well-defined benchmarking method for second generation read mappi Newsbot! Literature Watch 0 09-20-2011 03:00 AM
Real biological 454 data publicly available for benchmarking Springbok28 454 Pyrosequencing 1 11-26-2009 12:48 AM
benchmarking BLAST szilva Bioinformatics 1 08-28-2009 09:36 AM
In Sequence: Illumina, ABI Testing How Their Next-Gen Tools Can Seq a Human Genome Newsbot! Illumina/Solexa 0 02-26-2008 03:20 PM

Reply
 
Thread Tools
Old 04-16-2011, 03:36 AM   #1
marcora
Member
 
Location: Pasadena, CA USA

Join Date: Jan 2010
Posts: 52
Unhappy Testing/benchmarking RNASeq DE tools

I am performing DE analysis of RNAseq datasets using DESeq, edgeR, and CuffDiff. While DESeq and edger (based on very similar models) agree for the most part, I have found that CuffDiff gives very different results.

In the absence of a synthetic "truth dataset" for RNAseq (a simulated dataset where I know a priori the set of differentially expressed genes), I am unable to decide which package is more sensitive/accurate. I am actually quite amazed that tools which are extensively used by the RNAseq community have not been tested in this manner.

Does anybody know where I can find such a dataset or a tool that can generate one???
marcora is offline   Reply With Quote
Old 04-18-2011, 01:10 AM   #2
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 993
Default

I've commented on this subject several times before, but as this is scattered through the forums, I'll try a comprehensive answer. My apologies to advanced readers for repeating many basic facts.

The reason you see such strong differences between DESeq and edgeR on the one hand and cuffdiff on the other hand is not because of the different methods used. Rather, there are two fundamentally different questions you may ask when analysing expression data:

(A) Is the concentration of a given transcripts different in two samples?

(B) May this difference be attributed to the difference in experimental conditions, i.e., may we be confident that it is due to the experimental treatment and not due to fluctuations not under the experimenter's control (not due to "biological variation")?


Allow me to discuss these in details:

(A) Given two samples, you measure the concentration of your transcript of interest by some technique (here RNA-Seq, but could also be microarray and RNA-Seq). You know that your measurement process has an inherent uncertainty ("technical noise"), and hence, even if the two samples are actually aliquots of the same processed library you do not expect to get two exactly equal measurements. If you know how strong the technical noise is, you can calculate how much difference you need to see between the samples before you can reject the null hypothesis that the transcript has the same concentration in both samples.

To find out about the strength of technical noise, you could compare technical replicates, i.e., two measurements of aliquots from the same sample. However, specifically for RNA-Seq, it is established that technical noise does not depend much on the experimental details. If all goes well, it is hardly above the theoretical limit imposed by the so-called Poisson or shot noise. This is the core result of the 2008 Genome Res. paper by Marioni et al. Hence, you do not need to compare technical replicates; even with a single sample, you can calculate the measurement uncertainty from theoretical considerations. If you compare whole genes, this calculation is simple: we expect the technical variance to be equal to the mean. Whenever a paper mentions Fisher's exact test, the hypergeometric test or a likelihood-ratio test, they are likely to refer to a variant of doing this. If you want to not compare just whole genes but distinguish isoforms, things get more complicated because the isoform deconvolution intrdouces further uncertainty. Taking this into account is what cuffdiff does.


(B) In RNA-Seq, we are typically interested in the effect of a treatment on a cell. (Treatment can be anything, from a drug to knocking down a gene, transfecting in a mutation, or just looking at another tissue type.) If you have a control and a treatment sample, you may use one of the tests just outlined to establish that your favorite gene has different expression in the two samples. This, however, does not mean that it is reacting to your treatment. Maybe the gene reacted to some environmental fluctuation, such as minor differences in temperature or nutritient concentration.

This is why you need biological replicates. There are countless small differences that the experimenter cannot control, and these will cause changes in gene expression that could be mistaken for effects of the treatment. Hence, the standard recipe is this: Do your experiment several times without changing the treatment ("biological replicates"), compare them to find out how strong genes vary even though you tried to keep everything the same, and then consider only those genes as effected by the treatment, for which the difference between differently treated samples ("between-group variation" in ANOVA terminology) is significantly larger thane the difference between samples treated identically ("within-group variation").

This is what DESeq, edgeR and BaySeq focus on. We try to get a good estimate of the biological variability, and to keep things simple, we do not try isoform deconvolution.


A good way to summerize this issue is this: Given two replicates, if you compare one against the other, do you want your tool to report significant differences? Cuffdiff addresses question (A) and hence will strive to not report differences between techical replicates but will report differences between biological ones. EdgeR, DESeq and BaySeq address question (B), i.e., they will strive to report no differences even when comparing two biological replicates with each other. (See this post, where marcora did the comparison for his data.)


This distinction was not that crucial in the beginning. This is because for low count rates, technical noise dominates and is so much stronger than biological noise that the answers to both questions are typically the same. However, once you have more than, say, around a hundred counts for a gene, technical noise becomes smaller than biological noise, and it will be essential to test for question (B), as you will find that the answer to (A) is nearly always yes for strongly expressed genes. (Given enough counts, you see the weak differences that are always there.)


Lior Pachter (lpachter) and I had a debate here on SeqAnswers a while ago, in which Lior strongly disagreed with my claim, that I just reiterated, that the assessment of biological variation and hence this distinction and is crucial. (I tried to link to the old thread with the debate but couldn't find it, sorry.)

I stick to my points, though: Performing experiments without replications will not and cannot generate results with a valid biological interpretation, and the fact that so many papers without replicates have been published recently does not make this any better. And any method which does not attempt to assess biological variability (and hence does not need replicates) should not be used, unless one has a good reason to be explicitly interested in question (A), not (B).


To come back to the original question about benchmarking methods for testing for differential expression: A test data set with gold standard should hence use biological replicates, not technical one. A comparison of technical replicates will miss one of the crucial difficulties, namely how to accurately estimate variance from few replicates. This is why I consider the paper by Bullard et al. (BMC Bioinf, 2010, 11:94) that did this as not helpful.

The main difficulty in DE analysis for RNA-Seq is to get the variance estimate rate right even when working with few replicates. If you had many replicates, you can use well established techniques from traditional inferential statistics. Hence, we would need a large data set with two conditions and maybe around 20 or so biological replicates for each condition. This would allow to establish a ground truth, and one could then challenge the tools to recover this ground truth if provided with only a subset, say, only two or three samples per condition.

Last edited by Simon Anders; 04-18-2011 at 01:18 AM.
Simon Anders is offline   Reply With Quote
Old 04-18-2011, 01:27 AM   #3
marcora
Member
 
Location: Pasadena, CA USA

Join Date: Jan 2010
Posts: 52
Default

Dear Simon,

thank you for your answer. I really appreciate your contributions to this forum, which make bench biologists like me understand better the statistical ground of the tools we use for RNASeq.

Specifically, I see your point about A vs B kind of scenarios. Recently CuffDiff implemented some changes that make it able to handle replicates... but it is not clear to me whether their algorithm addresses the B scenario like deseq/edger/bayseq do. This is why I was inquiring about a "truth" dataset to be able to test the output of these tools. Until somebody comes up with a test dataset including 20 biological replicates, do you know of any tool (similar to fluxsimulator) that can generate one "in silico"???

Thanx!!!
marcora is offline   Reply With Quote
Old 04-18-2011, 02:41 AM   #4
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 993
Default

I would say that, no, the new cuffdiff still addresses scenario A. I conclude this from the fact that there is no mention of an estimation of biological variance in the new cufflinks paper (Roberts et al., Genome Biology 2011, 12:R22). Rather, the paper explicitly states that their aim is to get an accurate figure for the actual abundance of a transcript in sample (i.e., they work in the framework of what I called question A), and they argue that biases that distort the estimate and should be corrected for may be better estimated if the algorithm is informed about which sample belongs to which treatment group. Their changes seem to be only about cufflinks, not cuffdiff.
Simon Anders is offline   Reply With Quote
Old 04-18-2011, 02:45 AM   #5
marcora
Member
 
Location: Pasadena, CA USA

Join Date: Jan 2010
Posts: 52
Smile

Quote:
Originally Posted by Simon Anders View Post
I would say that, no, the new cuffdiff still addresses scenario A. I conclude this from the fact that there is no mention of an estimation of biological variance in the new cufflinks paper (Roberts et al., Genome Biology 2011, 12:R22). Rather, the paper explicitly states that their aim is to get an accurate figure for the actual abundance of a transcript in sample (i.e., they work in the framework of what I called question A), and they argue that biases that distort the estimate and should be corrected for may be better estimated if the algorithm is informed about which sample belongs to which treatment group. Their changes seem to be only about cufflinks, not cuffdiff.
Have you read this http://cufflinks.cbcb.umd.edu/howitworks.html#hdif, I did but I cannot quite understand it (I am not a biostatistician).

Going back to the original question... do you know of something like fluxsimulator that can generate rnaseq dataset where the differentially expressed species are known in advance???
marcora is offline   Reply With Quote
Old 04-18-2011, 02:48 AM   #6
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 993
Default

Concerning the in-silico simulation: I have a simple script that produces a count table of simulated data and if you use it to benchmark methods, DESeq comes out best. This is not at all surprising. When developing a method, I propose a statistical model of which I think that it captures well the salient features of what is happening in real data, and if I write a simulation, I will, of course, use the same model. If another tool assumes another model it will perform worse than my tool when tested with data from my simulation.

By this I mean: The difference between methods can be either about how to implement a test for a given model, or about the question for what model to implement a test. Simulation only helps to compare two different tests developed for the same model assumption. To test which model is more appropriate you need real data.

I see shortcomings in both our (DESeq's) and edgeR's assumption about how real biological variance looks like. Whether these matter in practice can only be checked with (lots of) real data.

Last edited by Simon Anders; 04-18-2011 at 02:49 AM. Reason: clarified wording
Simon Anders is offline   Reply With Quote
Old 04-18-2011, 02:53 AM   #7
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 993
Default

Quote:
Originally Posted by marcora View Post
Isoform deconvolution causes extra uncertainty because if a read maps to an exon shared by several isoforms, you don't know to which to assign it. This counts as technical noise, not biological noise. (You want to know how close you are to the actual concentration of the isoforms in the specific sample under consideration, not how close you are to the population average over all samples from the same condition.) The Jensen-Shannon divergence is a way to assess this uncertainty. This is all clearly "question A".
Simon Anders is offline   Reply With Quote
Old 04-18-2011, 03:01 AM   #8
marcora
Member
 
Location: Pasadena, CA USA

Join Date: Jan 2010
Posts: 52
Thumbs up

Quote:
Originally Posted by Simon Anders View Post
Isoform deconvolution causes extra uncertainty because if a read maps to an exon shared by several isoforms, you don't know to which to assign it. This counts as technical noise, not biological noise. (You want to know how close you are to the actual concentration of the isoforms in the specific sample under consideration, not how close you are to the population average over all samples from the same condition.) The Jensen-Shannon divergence is a way to assess this uncertainty. This is all clearly "question A".
Got it! Thanx a lot for your valuable help.
marcora is offline   Reply With Quote
Old 04-18-2011, 03:06 AM   #9
marcora
Member
 
Location: Pasadena, CA USA

Join Date: Jan 2010
Posts: 52
Default

Quote:
Originally Posted by Simon Anders View Post
Concerning the in-silico simulation: I have a simple script that produces a count table of simulated data and if you use it to benchmark methods, DESeq comes out best. This is not at all surprising. When developing a method, I propose a statistical model of which I think that it captures well the salient features of what is happening in real data, and if I write a simulation, I will, of course, use the same model. If another tool assumes another model it will perform worse than my tool when tested with data from my simulation.

By this I mean: The difference between methods can be either about how to implement a test for a given model, or about the question for what model to implement a test. Simulation only helps to compare two different tests developed for the same model assumption. To test which model is more appropriate you need real data.

I see shortcomings in both our (DESeq's) and edgeR's assumption about how real biological variance looks like. Whether these matter in practice can only be checked with (lots of) real data.
I was thinking more in the line of a tool that can generate a transcriptome of known composition (with, e.g., 100 defined genes differentially expressed at various levels), and then simulate the sequencing of this transcriptome to generate actual reads. This could be then fed into something like tophat > htseq-count > deseq or tophat > cuffdiff and then compare the output list of DEGs with the one used as input for the transcriptome generation algorithm.
marcora is offline   Reply With Quote
Old 05-26-2011, 07:30 AM   #10
Puva
Junior Member
 
Location: USA

Join Date: Apr 2011
Posts: 5
Default

I analysed my data with edgeR and I got 490 differentially expressed genes(300 upregulated and 190 down regulated). I am little bit confused. I am not sure the cut off value used to categorise up and down regulation. When I checked the fold changes, I could not find. If I want to do a pathway analysis, I am not sure which value I have to use ie fold change or p value. Can any body help me? I am not good is statistics.
Puva is offline   Reply With Quote
Old 03-30-2012, 11:51 AM   #11
wesserg
Junior Member
 
Location: University of Warsaw - Poland/ GHSU - USA, Augusta GA

Join Date: Feb 2012
Posts: 4
Default

Quote:
Originally Posted by marcora View Post
I am performing DE analysis of RNAseq datasets using DESeq, edgeR, and CuffDiff. While DESeq and edger (based on very similar models) agree for the most part, I have found that CuffDiff gives very different results.

In the absence of a synthetic "truth dataset" for RNAseq (a simulated dataset where I know a priori the set of differentially expressed genes), I am unable to decide which package is more sensitive/accurate. I am actually quite amazed that tools which are extensively used by the RNAseq community have not been tested in this manner.

Does anybody know where I can find such a dataset or a tool that can generate one???
Hello marcora.

It is a pretty old thread but relevant for evaluating many RNA-seq statistical methods. I was wondering if you were successful in your searches for benchmark dataset to evaluate any methods for differential expression detection?

Precisely: we need some artificial, BAM files in which we decide which hypothetical genes will turn out differentially expressed, after processing the BAM files. Those files do not need to reflect any real biological behavior.
__________________
Sergiusz Wesolowski
wesserg is offline   Reply With Quote
Old 03-30-2012, 11:57 AM   #12
greigite
Senior Member
 
Location: Cambridge, MA

Join Date: Mar 2009
Posts: 141
Default

DESeq provides a test data set "pasilla," described in the vignette. Not sure if that's what you are looking for.
greigite is offline   Reply With Quote
Old 03-30-2012, 12:43 PM   #13
wesserg
Junior Member
 
Location: University of Warsaw - Poland/ GHSU - USA, Augusta GA

Join Date: Feb 2012
Posts: 4
Default

@greigite

Thank you for your response, but unfortunately, if I understand correctly what "pasilla" dataset is, it is not what I am looking for.

I was going through it now (didn't know the dataset existed before) and the problem is that: we do not know which genes in this dataset are truly differentially expressed between conditions - this is the main property of a dataset we want to have.

Secondly the dataset contains "count data" which means we will not be able to use it for any comparisons with RPKM/FPKM based methods, particularly cuffdiff
__________________
Sergiusz Wesolowski
wesserg is offline   Reply With Quote
Old 03-30-2012, 01:27 PM   #14
marcora
Member
 
Location: Pasadena, CA USA

Join Date: Jan 2010
Posts: 52
Default

Quote:
Originally Posted by wesserg View Post
Hello marcora.

It is a pretty old thread but relevant for evaluating many RNA-seq statistical methods. I was wondering if you were successful in your searches for benchmark dataset to evaluate any methods for differential expression detection?

Precisely: we need some artificial, BAM files in which we decide which hypothetical genes will turn out differentially expressed, after processing the BAM files. Those files do not need to reflect any real biological behavior.
The best I could find is in here http://bioserver.hci.utah.edu/TestDatasets/
marcora is offline   Reply With Quote
Old 03-30-2012, 11:06 PM   #15
dietmar13
Senior Member
 
Location: Vienna

Join Date: Mar 2010
Posts: 107
Default i want add

-SAMseq,
-PoissonSeq, and
-npseq
(all from the entourage of robert tibshirani)
http://www-stat.stanford.edu/~tibs/SAM/
http://www.stanford.edu/~junli07/research.html

they all work with permutations (non-parametric) and not with a statistical model (i.e. poisson, neg. binomial) and are therefore not sensitive to co-regulated genes.
in my hands they worked best for a experimental design with 12 biological replicates.

and some of them are able to use mached pair designs, more than two groups, right-censored outcomes (OS, DFS, PFS), and quantitative outcomes.

i don't understand why these packages are so unknown...
dietmar13 is offline   Reply With Quote
Old 04-01-2012, 01:26 PM   #16
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 993
Default

Quote:
i don't understand why these packages are so unknown...
Very simple: Permutation-based tests cannot be used if you only have very few replicates, and this is usually the case. I am not aware of a single published dataset that compares samples from two different biological conditions and has more than four biological replicates per condition. You are unusually fortunate to have such a large data set.

Once you have many replicates, permutation tests are in fact the way to go.
Simon Anders is offline   Reply With Quote
Old 04-01-2012, 02:00 PM   #17
kopi-o
Senior Member
 
Location: Stockholm, Sweden

Join Date: Feb 2008
Posts: 319
Default

Quote:
they all work with permutations (non-parametric) and not with a statistical model (i.e. poisson, neg. binomial) and are therefore not sensitive to co-regulated genes.
How is the "permutation vs. model-based" distinction related to co-regulated genes?
kopi-o is offline   Reply With Quote
Old 04-02-2012, 11:14 AM   #18
dietmar13
Senior Member
 
Location: Vienna

Join Date: Mar 2010
Posts: 107
Default i am no hard-core statistician,

but co-regulated genes affect (and disturb) distributional assumptions used in model-based approaches. the null-hypotheses generated by permutations are independent for every gene.
dietmar13 is offline   Reply With Quote
Old 04-04-2012, 03:19 AM   #19
Wolfgang Huber
Senior Member
 
Location: Heidelberg, Germany

Join Date: Aug 2009
Posts: 108
Default

Dear Dietmar13,

let me respectfully disagree. Permutation tests, if applied properly, do respect the correlations between the genes and can model them as part of the null hypothesis.

Also, approaches that test gene-by-gene ("marginally", in the language of statisticians), including distribution- or model-based ones, are not directly affected by correlations between genes.

A more serious problem are correlations between samples, e.g.because of batch effects or poor experimental design.

Best wishes
Wolfgang
__________________
Wolfgang Huber
EMBL
Wolfgang Huber is offline   Reply With Quote
Old 04-04-2012, 07:24 AM   #20
wesserg
Junior Member
 
Location: University of Warsaw - Poland/ GHSU - USA, Augusta GA

Join Date: Feb 2012
Posts: 4
Default RNA seq simulator output

Thank you all for interesting suggestions and comments.

@marcora: Indeed this dataset seems to be what I am looking for, but do you have any idea what kind of file format is it? Was trying several standard converters, but am not able to deal with it. Anyway - thank you very much for sharing the link;]

Problem:
I was also trying to generate myself such dataset using the RNA-seq Simulator from USeq bundle - but again - I have no idea where, the output format is specified. Anyone had previous experience with RNA-seq Simluator?
__________________
Sergiusz Wesolowski
wesserg is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:44 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO