SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Can edgeR/DESeq have more than one covariate? arrchi Bioinformatics 8 10-28-2013 02:37 PM
RNA seq and DGE using edgeR greggime RNA Sequencing 14 07-12-2012 02:01 AM
DESeq and edgeR up/down regulation murphycj Bioinformatics 7 09-21-2011 07:09 AM
BaySeq vs GLM in EdgeR and DESeq Hilary April Smith Bioinformatics 0 08-03-2011 10:14 AM
edgeR vs DESeq vs bayseq Azazel Bioinformatics 1 10-07-2010 07:11 AM

Reply
 
Thread Tools
Old 03-15-2010, 01:59 AM   #1
Roman Bruno
Junior Member
 
Location: France

Join Date: Mar 2010
Posts: 4
Unhappy Multiple DGE libraries comparison. (EdgeR baySeq DESeq)

Hi, I'm an undergratuated student in msc Biostatistics in France, and I'm stuck with DGE data.

I'd like to compare multiple DGE librairies without replicates. I got 4 librairies that describe 4 different conditions (Young Males/Females , Old Males/Females) and I'd like to know the top diffentially expressed tags between all Males vs all Females. But I can't find a solution because performing all possible pair-wise tests in several libraries is statistically invalid and would lead to an unacceptable
accumulation of false positives.

I've searched through the litterature without success so far and the 3 packages I'm using don't seem to treat this kind of comparison.

I would appreciate any form of help (litterature, idea, R packages...)

Thank you
Roman Bruno is offline   Reply With Quote
Old 03-15-2010, 11:34 AM   #2
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Hi Roman

the simple solution is as follows: Consider the samples from the two male subjects as replicates, and the two female ones likewise, and go ahead using DESeq or edgeR.

You may now say: But an old man is no proper replicate for a young man. However, remember that the point of replicates is not to have two samples which are as similar as possible.

Rather, you have many experimental conditions, some of them you vary on purpose (here: sex of the subject) and some which you cannot control, either because you have to work with the samples that you got (here: age) or because you don't know them at all (here: medical history, race, accidental differences in preprocessing of samples).

You use replicates for two reasons: (i) You want to have all the different possibilities for the unwanted conditional variation in equal proportions, so that they average out. If you don't even know what these differences are, your best bet is to take many random samples in the hope that they average out well. (ii) You want to know how much the samples vary only due to the unknown and not-controlled-for condition differences, so that you will call a difference with respect to the condition of interest significant only if it is sufficiently larger than what you observe if you keep the experimental condition fixed and only the other stuff (that you cannot control) for varies.

In your case, you cannot control for age, simply because you do not have enough samples, so you are justified to treat age in the same way as you treat differences that you don't even know about, such as differences in the subject's living conditions, differences in the sample preparation etc.: namely as an unknown covariate that you have to average over, i.e., you ignore the age information.

Of course, if you had age-matched replicates, you could add age as a factor to your model and promote it to an explained covariate. This would give you additional statistical power, i.e., you will be able to call smaller differences significant. Nevertheless, your result from treating the non-age matched subjects as replicates is statistically valid -- only unfortunate, as you won't get as many hits as you would with better replicates.

If you want to know how much you miss this way, you can check how much effect age has on expression levels by repeating the analysis with age as main difference and sex as ignored covariate, i.e., consider the two young subjects as replicates, and the two old ones likewise.

Actually, if you now managed to find more samples so that you have replicates that are simultaneously matched for age and sex you will find a new problem: To take advantage of this, you have to model both simultaneously, and this is something that, in fact, none of the current methods can do well. You could use the VST from our DESeq package and then limma but this costs a lot of power. I am currently looking into this and hope to figure out soon how to make a proper generalized linear model for negative binomial distribution, so stay tuned.

Cheers
Simon
Simon Anders is offline   Reply With Quote
Old 03-16-2010, 12:57 AM   #3
Roman Bruno
Junior Member
 
Location: France

Join Date: Mar 2010
Posts: 4
Default

It's the first solution I've suggested to my professor but he didn't like the fact that some count (for the expressed tag) was similar between young male and old female.

Anyway thank you a lot for your quick response, I'm going to follow your advices.

Regards

Roman
Roman Bruno is offline   Reply With Quote
Old 03-16-2010, 01:57 AM   #4
sci_guy
Member
 
Location: Sydney, Australia

Join Date: Jan 2008
Posts: 83
Default

I am a molecular biologist/biochemist but I can also do bioinformatics and biostatistics (lite). I imagine your professor is of a biological background and doesn't fully appreciate the statistics. Most biologists I know can't grasp the finer details of replicates and multiplicity correction. Often I hear if the replicates are almost identical then they are a waste of money and if the alpha level is 0.05 then doing a multiplicity correction is throwing out too much data. While I sometimes do not use an FDR and instead use the correlation of mutiple probes within a gene I have an appreciation that millions of Chi Sq, Fishers exact tests, t-tests or ANOVAs will give many false positives.

He/She should have consulted a statistician before they designed the experiment and paid for the sequencing. If they are receptive I would school them in some Stats 101. It might make their future experimental design a little better.

My colleague has battled for months with his supervisor to run 4 replicates per condition instead of 2 for a microarray experiment. T-tests with 2 reps are yuk! It's a false economy to run 2 reps (if the treatment effect sizes are modest).

Anyway, my rant aside, Simon has some excellent suggestions to get you out of your tricky situation.
sci_guy is offline   Reply With Quote
Old 03-30-2010, 01:49 AM   #5
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

I've just got a private mail about this thread, asking "Why is limma not a good option for doing this?", and I thought I better reply in a public post, as I should relativize this a bit. Limma might actually work quite well, see below.

I write in some detail in the hope to stimulate exchange of experience with other people trying to do similar analyses.

As I've written above, one may want to examine more complex contrasts than just comparing one treatment or experimental condition with another one, and neither edgeR nor DESeq provide for this at the moment. Typically, if one would like to study the interaction of two or more conditions (i.e., check, whether the effect of A and B together is different from the sum of the effect of A alone and the effect of B alone), one would fit a linear model with an appropriate design matrix. For microarrays, Robinson and Smyth's 'limma' software proved very useful for such purposes, as it allowed to pool information from many genes in order to get good variance estimates even in case of few replicates.

For RNA-Seq, one cannot use ordinary linear models (including 'limma') straight away because sequence count data is neither homoscedastic nor can the residuals be expected to be normally distributed. A generalized linear model of the Poisson family (as done by a number of authors) is a bad idea as it underestimates variances, possibly severely, and so gives much too low p values.

Maybe one could use a form of generalized linear models tailored for the negative binomial case (or a quasi-Poisson GLM with proper variance estimation), and I intend to study this a bit.

Another approach is to transform the data to make it homoscedastic by means of a variance-stabilizing transformation (VST). Taking the logarithm would be a very simple approach to this end. Better results can be achieved by estimating the variance-mean dependence and then obtain the corresponding VST numerically. The function 'getVarianceStabilizedData' of our DESeq package does that. (Use at least DESeq version 0.7.12; there was a bug in this function until yesterday.)

Once one has variance stabilized data, one can fit an ordinary linear model. As the replicate number is typically too small to get a proper per-gene fit, one needs to pool variance information, and this is what limma's 'empirical Bayes' functionality does.

I've tested this approach on a couple of data sets, by checking for differential expression in a simple comparison of one condition with another, first with DESeq's 'nbinomTest', then with 'limma' acting on the data after VST. The results are very mixed: sometimes limma finds about 2/3 as many hits as DESeq, but often, it does not find any at all. This is probably due to the fact that limma is designed for data with normally distributed error terms. The VST, however, only makes the data homoscedastic but the residuals are still for from normal.

For simulated data with constant dispersion, limma's hits are always a subset of DESeq's hit, while for real data, many limma hits are not shared with DESeq, because limma's eBayes procedure adjusts DESeq's variance estimate on a per-gene bases. I am not sure whether this is good or bad. If one does not want to trust it, one could set the variance to the common estimate from all VST data, i.e., simply resort to a z test instead of using limma.

If anyone out there also wants to fit interaction contrasts, I'd be glad to exchange experiences to find out what works best.
Simon Anders is offline   Reply With Quote
Old 03-31-2010, 10:57 AM   #6
Boel
Member
 
Location: Stockholm, Sweden

Join Date: Oct 2009
Posts: 62
Default fitting interaction contrasts

The design of my current project will need a model such as limma, I will have data from several different cell types, some of them activated, some not etc.

Would your reasoning change in any way when assessing normalized data, such as RPKM or FPKM? Could I get limma to work using FPKM values?

Thanks,
Boel
Boel is offline   Reply With Quote
Old 04-01-2010, 12:34 AM   #7
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

It should work with both, variance-stabilized count data and FPKM data. However, FPKM data needs to be normalized and variance-stabilized, too, before you can give it to limma and I guess that it is harder there. The reason is that with the normalization by transcript length, you loose the information about the initial count value, i.e., you do not know whether these are many counts of a long transcript or few counts of a short one. Even if both cases give the same FPKM value, they will have different variance but limma will have a hard time noticing this.

Hence, I would go for count data and the VST that we describe in our paper.

You can try both and compare. The crucial part here is to remember that it is not about "the more hits, the better". You want to be sure that you control type-I error, i.e. that you do not get more false positives than indicated by the nominal false discovery rate.

A good way to check whether a testing procedure maintains type-I error control is to use biological replicates and test for differential expression between the replicates. As there should be none, the p values have to look uniform. Examine the p values for weakly and for strongly expressed genes separately for uniformity to check for bias.

I am not sure, though, that this can be done with limma out of the box if you have less than four replicates. But it should work with some tweaking.

Simon
Simon Anders is offline   Reply With Quote
Old 04-01-2010, 10:02 AM   #8
lpachter
Member
 
Location: Berkeley, cA

Join Date: Feb 2010
Posts: 40
Default

I'd like to clarify one point here:

There is no such thing as "FPKM data" (or for that matter "RPKM data"). The only data in an RNA-Seq experiment is the read counts. FPKM is a unit for measuring expression (expected fragments per kilobase of transcript per million of reads sequenced). It is just a constant multiple of the estimated fraction of transcript in the sample.

Now in terms of differential expression analysis, Simon is correct that for a measurement to be useful (whatever units it is in) it is important to know its variance. That is why in Cufflinks, a program that reports transcript expression in FPKM units, the variance is reported as well.
lpachter is offline   Reply With Quote
Old 04-01-2010, 11:00 AM   #9
Cole Trapnell
Senior Member
 
Location: Boston, MA

Join Date: Nov 2008
Posts: 212
Default

One more thing: when cuffdiff reports the results of differential testing, it's being performed on the actual counts (after log transformation) for the reasons that Simon suggests.
Cole Trapnell is offline   Reply With Quote
Old 04-03-2010, 09:39 AM   #10
RockChalkJayhawk
Senior Member
 
Location: Rochester, MN

Join Date: Mar 2009
Posts: 191
Default

Quote:
Originally Posted by Cole Trapnell View Post
One more thing: when cuffdiff reports the results of differential testing, it's being performed on the actual counts (after log transformation) for the reasons that Simon suggests.
Cole,

Do you think it would be a good idea to use quantile normalization on tag counts before running cuffdiff or is this unecessary given that it didn't actually use RPKM values (which I didn't know, but am glad to hear)?
RockChalkJayhawk is offline   Reply With Quote
Old 04-05-2010, 01:28 AM   #11
steven
Senior Member
 
Location: Southern France

Join Date: Aug 2009
Posts: 269
Default

Quote:
Originally Posted by lpachter View Post
There is no such thing as "FPKM data" (or for that matter "RPKM data"). The only data in an RNA-Seq experiment is the read counts.
I heard from people working on alignment that "the only data is the read sequences" and from people working on base calling that "the only data is the raw images".. at least, one consensus: for sure that makes a lot of data
steven is offline   Reply With Quote
Old 04-05-2010, 04:35 AM   #12
Boel
Member
 
Location: Stockholm, Sweden

Join Date: Oct 2009
Posts: 62
Default

Quote:
Originally Posted by Cole Trapnell View Post
One more thing: when cuffdiff reports the results of differential testing, it's being performed on the actual counts (after log transformation) for the reasons that Simon suggests.
Is there a way to output the actual counts?
Boel is offline   Reply With Quote
Old 04-06-2010, 10:15 AM   #13
gen2prot
Member
 
Location: Hyderabad, India

Join Date: Apr 2010
Posts: 64
Default

Hi Simon,

I am trying to install DESeq on MAC OSX Leopard. I get the error message "In getDependencies(pkgs, dependencies, available, lib) : package ‘DESeq’ is not available". Any clue why this is happening. I found that DESeq needs BioBase to be installed. I have done that, and yet I get this error. Any ideas?

Thanks
Abhijit
gen2prot is offline   Reply With Quote
Old 04-06-2010, 11:21 AM   #14
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Hi Abhijit

have you followed the instructions on http://www-huber.embl.de/users/anders/DESeq/ ?

If so, please send me the output from the installation. The real error message should be hidden in there.

Cheers
Simon
Simon Anders is offline   Reply With Quote
Old 04-07-2010, 01:52 AM   #15
Wolfgang Huber
Senior Member
 
Location: Heidelberg, Germany

Join Date: Aug 2009
Posts: 109
Default

Quote:
Originally Posted by lpachter View Post
I'd like to clarify one point here:

...for a measurement to be useful (whatever units it is in) it is important to know its variance. ....
While we're at it...: there is also no such thing as "the variance of a measurement". There is the sample variance of a set of measurements; and there is the variance of a statistical ensemble (physicists' language) or distribution in a probabilistic model (mathematicians' language), whose "true" value we often do not know, but which we can aim to estimate from data.

The choice of ensemble matters: repeated measurements on the same sample using the same machine will have smaller variance than repeated measurements using different machines, or different biological replicates, e.g. different cell passages.
__________________
Wolfgang Huber
EMBL
Wolfgang Huber is offline   Reply With Quote
Old 04-07-2010, 07:52 AM   #16
lpachter
Member
 
Location: Berkeley, cA

Join Date: Feb 2010
Posts: 40
Default

Dear Wolfgang,

Thank you for pointing out my typo- namely that "measurement" should have been "estimate".
lpachter is offline   Reply With Quote
Old 04-08-2010, 10:25 AM   #17
gen2prot
Member
 
Location: Hyderabad, India

Join Date: Apr 2010
Posts: 64
Default

Hi Simon,

Thank you for your help. I got DESeq to work. However, I am trying to figure out a good value to use for the FDR cutoff. This maybe a very stupid question, but I'll ask anyways. Is an FDR of 10% more stringent than an FDR of 1%. I am thinking in terms of p-values where lesser numbers are higher stringency. Should I imagine FDR along the same lines or the opposite.

Thanks
Abhijit
gen2prot is offline   Reply With Quote
Old 04-08-2010, 11:00 AM   #18
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

The same lines.

A false dicovery rate (FDR) of 10% means that (in expectation) 10% of your hits are false positives. To get FDR control at 10%, you adjust the raw p values for multiple testing with an FDR-controlling method such as the one by Benjamini and Hochber or the one by Storey, and the take all those genes with an adjusted p value below the desired FDR. DESeq does this for you, using the Benjamini-Hochberg method.

Controlling at FDR 10% is a common, but, of course, arbitrary choice. It is up to you to decide how stringent you want to be.

The concept of FDR has been introduced by Benjamini and Hochberg in 1995 (J Roy Stat Soc B 57 289). Since then, there has been a lot of research on how best to adjust p values to allow for FDR control, and how well this works in case of correlation.
Simon Anders is offline   Reply With Quote
Old 04-08-2010, 02:20 PM   #19
gen2prot
Member
 
Location: Hyderabad, India

Join Date: Apr 2010
Posts: 64
Default

Hi Simon,

How can I set the FDR at 1% instead of 10%. Is there a way of doing this through the nbinomTest function? I am not an R programmer so I do not know.

Thanks
Abhijit
gen2prot is offline   Reply With Quote
Old 04-08-2010, 10:40 PM   #20
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

If you followed the example in the vignette, you have come across the line

resSig <- res[ res$padj < .1, ]

This is where we select the genes, which we want to consider significant. res$padj is the Benjamini-Hoch-berg-adjusted p value, and selecting all genes with padj below 0.1 controls the FDR at 10%. To get 1%, just put .01 here.
Simon Anders is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:32 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO