
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
using Cuffdiff with biological replicates  Jane M  RNA Sequencing  0  09012011 01:42 AM 
overdispersion and biological replicates  shilez  Bioinformatics  3  08292011 08:43 AM 
overdispersion and biological replicates  shilez  RNA Sequencing  0  08252011 07:10 PM 
cuffdiff with biological replicates  PFS  Bioinformatics  1  06142011 07:51 PM 
Inconsistency between biological replicates  Nicholas_  Bioinformatics  1  04062011 04:18 AM 

Thread Tools 
10012010, 02:20 PM  #1 
Junior Member
Location: NY Join Date: Oct 2010
Posts: 5

DEG for paired samples, biological replicates
Hi.
I am new to the seq world, although with experience in microarray aand teh omic world. I want to know which program is available to compare RNA seq data when you have biological replicates and paired samples. I have samples for 3 patients, 2 samples each, control and lesion. I have being playing with cufflinks but not sure if it does handle this kind of experimental design. 
10042010, 09:19 AM  #2 
Senior Member
Location: Boston, MA Join Date: Nov 2008
Posts: 212

Cuffdiff now supports replicates, so it should handle this sort of setup

10052010, 02:02 AM  #3 
Senior Member
Location: Heidelberg, Germany Join Date: Feb 2010
Posts: 994

Cole, if cufflinks can handle paired samples, I'd be really impressed, but I wonder if you simply were to fast with your reply and overlooked the fact that the question was about paired samples. If not, please correct me.
For readers unfamiliar with the issue, a quick reminder of textbook knowledge with an example for a paired t test: Imagine you have 5 sample, for which you measure some quantitative trait before and after a certain treatment. Let's say this trait varies according to a normal distribution and the data looks like thisTo come back to DEGs: Whenever samples are paired (i.e., the same sample is measured twice under different conditions) and unless the difference between the samples is typically much smaller than the effect of the treatment, we dramatically loose statistical power if our method is unable to make use of the pairing information. To my knowledge, none of the currently released tools can do this, though. We have recently expanded our DESeq package to be able to fit generalized linear models (GLMs), and these can be used to model the pairing. Unfortunately, our method to estimate the dispersion (which, for count data, takes the role of the standard deviation of the differences in the example above) does not work for paired designs. We have some ideas how to get around this and are testing them at the moment, but it does not yet work as well as we would like it. As far as I know, the edgeR people seem to pursue similar ideas. DESeq offers a function for a "variance stabilizing transformation" which translates the count data onto a continuous scale such that it becomes approximately homoscedastic. This allows then to use tools that worked well for microarrays, such as pairwise ttests or Smyth's 'limma' package. However, the transformation costs power and introduces bias in case the library sizes are too different. Still, it may a good way to get started. Simon 
10052010, 05:47 AM  #4 
Senior Member
Location: Boston, MA Join Date: Nov 2008
Posts: 212

Cuffdiff doesn't explicitly support sample pairing, hypatia, but I suggest you try the newest version of Cuffdiff (0.9.1) if you haven't already, as it should get you started pretty quickly. You may see fewer differentially expressed genes due to a loss in power, but hopefully being able to find differentially spliced genes or those undergoing shifts in promoter preference will make up for it

10052010, 06:45 AM  #5 
Junior Member
Location: NY Join Date: Oct 2010
Posts: 5

My first objective is really differential expression, specially in the lower range of expression and I already know that in my disease model, the correlation of each patient fold change is not high, around 0.4, so the paired model will make a huge difference here.

10052010, 08:54 PM  #6 
Member
Location: Melbourne Join Date: May 2010
Posts: 16

GLMs in edgeR
Hi hypatia and others
We have recently implemented GLM methods in edgeR, so the package can now deal with paired designs as well as other more complicated designs as well. One of the PhD students in our division has been working on using CoxReid conditional inference to estimate the dispersion parameter for the negative binomial model. This approach does take into account the paired nature of the samples (or indeed works whatever the experimental design) and has been giving us very reasonable results in our testing. Following Simon's work with DESeq, the edgeR methods can now also add a gene abundancerelated trend on the dispersion estimates. These new methods, the GLMs with CR estimation of the dispersion (plus a whole lot of other improvements), are all implemented in the current development version of edgeR in the Bioconductor repository. They will be rolled out into the release version with the release of Bioconductor 2.7 on 18 October. We'll be adding to the documentation over the next couple of weeks to get some examples in there of using these methods on paired designs and other more complicated experimental designs. The new edgeR methods have been developed with exactly this sort of application in mind, so I certainly encourage you to give them a try. We'd be really interested in how they work for you. Best regards Davis 
10062010, 09:49 AM  #7 
Member
Location: Berkeley, cA Join Date: Feb 2010
Posts: 40

Hi Hypatia,
From my current reading of the literature, it seems to me that baySeq may be a good solution for you right now: http://www.biomedcentral.com/14712105/11/422/ Regarding Cufflinks, I would like to correct Simon who is not a developer of the program and therefore may not understand it completely (Simon, please correct me if I am wrong). It is not accurate to say that it does not support sample pairing. What Cufflinks does is estimate expression values according to a generative model of the sequencing process, one that currently takes into account sequencing bias of various kinds. The value of paired samples is that the experimental bias should be similar in the pairs, and this will be implicitly "learned" by Cufflinks, so that its not clear to me a priori that it will produce inferior results to methods that learn the distributions of counts. 
10062010, 10:12 AM  #8  
Senior Member
Location: Heidelberg, Germany Join Date: Feb 2010
Posts: 994

Quote:
Quote:
I'm a bit puzzled what you might mean by "implicit learning", and by the fact that you talk about sequencing bias. (The value of paired designs, as I understand the term, is not to reduce bias but to reduce variance.) Anyway, I guess, this discussion has to wait until you have written up and published the method behind the new biological replicate functionality. Simon 

10062010, 04:06 PM  #9 
Member
Location: Berkeley, cA Join Date: Feb 2010
Posts: 40

Regarding baySeq, I am not an author on that software so I cannot speak for the details of it. I just mentioned it because it seemed like they do a lot of things right on a lot of aspects of differential expression analysis. They actually say in their paper that they do not handle paired samples, but as you pointed out no program currently does, and many of the other details matter as well.
The relationship between bias and variance is that bias causes variance. For an explanation of this see http://genomebiology.com/2010/11/5/R50 
01182011, 03:52 PM  #10  
Junior Member
Location: Boston Join Date: Jan 2011
Posts: 1

Hi Davis,
do you have any updates about the documentation for using edgeR in dataset with paired samples design? You also mentioned some "very reasonable results in our testing"... are these results publicly available now? All the best! Quote:


01302011, 04:29 PM  #11 
Member
Location: Melbourne Join Date: May 2010
Posts: 16

Hi f1boston
The edgeR functions for analysing differential expression with a paired samples design are documented in the package. We are planning on updating the User's Guide substantially to include better examples of using the GLM methods with paired and other experimental designs, but unfortunately this has taken a back seat while we have been putting a lot of work into the development of the new methods. We don't have any publicly available results as such  but all of the methods are available through Bioconductor so anyone could test them themselves. I certainly encourage you to do so! Finding a suitable yardstick for comparison with other/previous methods is difficult  hence why I said that the results look reasonable. They do look reasonable, but the actual 'truth' is not known in the datasets we have seen. The important point is that the GLM methods can properly analyse paired designs, whereas our older methods could not. If you have more specific questions that I may be able to help you with please feel free to get in touch. Best regards Davis 
10272011, 10:58 AM  #12  
Junior Member
Location: NYC Join Date: Apr 2011
Posts: 5

How many pairs is need to gain statistic power for paired study using GLM methods
Hi Davis,
I was wondering using edgeR GLM method for paired study, how many paired would be required for gaining certain statistic power? Cheers, Sheng Quote:


01082012, 06:30 PM  #13 
Member
Location: Melbourne Join Date: May 2010
Posts: 16

Hi Sheng
edgeR can indeed be used to analyse RNAseq data from paired designs, but your question is far, far too vague for me to be able to give you any sensible answer about how many samples you need. In general, the answer is "as many as you can afford". Cheers Davis 
01112012, 10:58 AM  #14 
Member
Location: Wisconsin Join Date: Jun 2011
Posts: 87

Simon,
I am sorry if this sounds stupid, but how does the experimental design of the poster differ from the pasillaGenes example in the DESeq documentation? I am probably not able to get a handle on what the poster might mean by paired samples, could you help me understand what we mean by a paired sample experimental design? Thanks, Praful Last edited by aggp11; 01112012 at 11:06 AM. 
01112012, 11:17 AM  #15 
Member
Location: Wisconsin Join Date: Jun 2011
Posts: 87

Simon,
Nevermind, I think I know understand the issue now. Thanks, Praful 
06032013, 05:41 AM  #16  
Member
Location: Switzerland Join Date: Apr 2013
Posts: 55

Quote:
I tried to use edgeR to analyse paired samples, and I also would like to use DESeq, but I just found "multifactor designs" using GLM in the DESeq manual. This cannot be used for paired samples, right? Could you give me any suggestion to analyse paired samples using DESeq? I noticed this thread was posted long time ago, so maybe there is new method....? Thank you! Best, Sadiexiaoyu 

06032013, 08:40 AM  #17  
Senior Member
Location: Heidelberg, Germany Join Date: Feb 2010
Posts: 994

Quote:
Let's say you have 2n samples, which come in n pairs. For example, we have a treatment and a control sample from each of n subjects. Then you proceed as follows: Make a design frame, i.e., a table describing the 2n samples, with one row per sample and two columns, say, "subject" (with levels "Subject_1", "Subject_2", "Subject_3", ...) and "treatment" (with levels "no" and "yes") and pass this as second argument to "newCountDataSet" as shown in the vignette's section on GLMs. Then fit two GLMs, namely "count ~ subject" as reduced model and "count ~ subject + treatment" as full model and compare them, again exactly as shown in the vignette (where we have "libType" instead of "subject"). Voilà, a paired test. AFAIK, it works the same way in edgeR, only that you don't specify the reduced formula explicitely but rather, edgeR simply drops the last term in the full formula to get the reduced one, i.e., "treatment" has to come after "subject". 

06032013, 01:34 PM  #18  
Member
Location: Switzerland Join Date: Apr 2013
Posts: 55

Quote:
Thank you for your answer. But I still did not get it clearly. 1) Do you mean using design like this: >design=data.frame(subject=c("sample1","sample2","sample3"),treatment=c("b,n","b,n","b,n")) > design subject treatment 1 sample1 b,n 2 sample2 b,n 3 sample3 b,n However, my count_data "sample1b“, ”sample1n","sample2b","sample2n","sample3b","sample3n",how can I combine my count_data with this design? Best, Sadiexiaoyu 

06032013, 01:35 PM  #19  
Member
Location: Switzerland Join Date: Apr 2013
Posts: 55

Quote:
I tried to use DESeq to do this paired analysis in the following (using DESeq manual "multifactor comparison" as reference), is this right? > x<read.delim("matrix for DESeq_n_b_new.txt",row.names=1,stringsAsFactors=FALSE) > head(x) X1b X2b X3b X1n X2n X3n gene1 1562 1380 1533 821 799 820 gene2 74 79 83 51 57 45 gene3 997 837 1230 700 513 705 gene4 83 59 81 49 71 67 gene5 517 456 512 301 322 307 gene6 0 0 2 0 0 0 > design=data.frame(row.names=colnames(x),treatment=c("b","b","b","n","n","n"),subject=c("sample1","sample2","sample3")) > design treatment subject X1b b sample1 X2b b sample2 X3b b sample3 X1n n sample1 X2n n sample2 X3n n sample3 > cdsFull=newCountDataSet(x,design) and using > fit1=fitNbinomGLMs(cdsFilt,count~subject+treatment) > fit0=fitNbinomGLMs(cdsFilt,count~subject) > pvalsGLM=nbinomGLMTest(fit1,fit0) > padjGLM=p.adjust(pvalsGLM,method="BH") 

06032013, 01:36 PM  #20  
Member
Location: Switzerland Join Date: Apr 2013
Posts: 55

Quote:
You said "edgeR simply drops the last term in the full formula to get the reduced one", is that mean, for example, if I set design<model.matrix(~subject+treatment) in edge R, then edgeR just get "treatment" with adjusting "subject" differences; if I set design<model.matrix(~treatment+subject) in edgeR, then edgeR will get "subject" with adjusting "treatment" differences. Am I right? Actually, I used design<model.matrix(~subject+treatment) in edgeR, and I get the following results: > topTags(lrt) Coefficient: treatment n genes logFC logCPM LR PValue FDR 15498 gene10 3.460666 8.556419 400.1173 5.192712e89 9.278857e85 20675 gene8 4.608737 7.967542 394.4721 8.796487e88 7.859221e84 15990 gene5 3.287753 5.762309 377.7188 3.904981e84 2.325937e80 731 gene71 3.515135 7.259228 376.8333 6.087180e84 2.719295e80 17873 gene889 3.560003 7.921179 363.5712 4.698794e81 1.679255e77 334 gene34 3.320911 8.088810 362.5978 7.654637e81 2.279679e77 22052 gene2272 3.718390 8.846441 356.9188 1.319823e79 3.369132e76 8133 gene89 3.004279 6.273042 336.3495 3.980106e75 8.890063e72 23453 gene275 3.576274 7.439596 330.8462 6.287517e74 1.248352e70 5681 gene56 3.205917 6.550010 330.3990 7.868486e74 1.406020e70 > colnames(design) [1] "(Intercept)" "sample2" "sample3" "treatmentn" I think the logFC is the foldchange between treatment n vs treatment b, using sample1 as the baseline for adjusting the differences among individuals. Is that right? Best, Sadiexiaoyu 

Tags 
biological replicates, differential expression 
Thread Tools  

