![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
running Cuffdiff with biological replicates | Jane M | Bioinformatics | 13 | 07-15-2013 08:55 AM |
using Cuffdiff with biological replicates | Jane M | RNA Sequencing | 0 | 09-01-2011 01:42 AM |
overdispersion and biological replicates | shilez | Bioinformatics | 3 | 08-29-2011 08:43 AM |
overdispersion and biological replicates | shilez | RNA Sequencing | 0 | 08-25-2011 07:10 PM |
cuffdiff with biological replicates | PFS | Bioinformatics | 1 | 06-14-2011 07:51 PM |
![]() |
|
Thread Tools |
![]() |
#1 |
Senior Member
Location: Charlottesville, VA Join Date: May 2011
Posts: 112
|
![]()
I'm using cuffdiff for a differential expression analysis and the cummeRbund R package for followup analysis/visualization. I have 5 biological replicates for control, 3 biological replicates for mutant condition. Based on the cuffdiff documentation I ran the following cuffdiff command:
Code:
cuffdiff cuffcmp.combined.gtf \ c1.bam,c2.bam,c3.bam,c4.bam,c5.bam \ m1.bam,m2.bam,m3.bam My question is, is this the proper way to specify biological replicates? The documentation seems to suggest that each comma separated file is a technical replicate of the same sample, but it isn't clear. Later on, when I went to use cummeRbund to do some visualization (e.g. boxplots, heatmaps), I'm only getting results for two samples, like it combined the FPKM values across all the alignments for each comma separated list. Code:
library(cummeRbund) cuff <- readCufflinks() #make boxplot csBoxplot(genes(cuff)) #get the top 100 diff expr genes gene.diff <- diffData(genes(cuff)) gene.diff.top <- gene.diff[order(gene.diff$q_value),][1:100,] # gene ids of top 100 diff expr genes myGeneIds <- gene.diff.top$gene_id # get genes myGenes <- getGenes(cuff, myGeneIds) # make a heatmap csHeatmap(myGenes, cluster="both") ![]() ![]() Thanks in advance! |
![]() |
![]() |
![]() |
#2 |
Member
Location: Sweden Join Date: Nov 2009
Posts: 83
|
![]()
I have understood the cuffdiff documentation the same way - each comma separated list of files represent replicates (not necessarily technical) of a sample group.
To me, cummerbund is giving you what I would expect it to - a box plot per condition that you input to cuffdiff and an indication of the variation within each of those groups. If you want to see the variation between the individual replicates I think you will need to run cuffdiff again treating each replicate as a separate sample. This meakes sense when you look at the output files produced by cuffdiff as these are what cummerbund is reading in. |
![]() |
![]() |
![]() |
#3 |
Senior Member
Location: Charlottesville, VA Join Date: May 2011
Posts: 112
|
![]()
@natstreet: Thanks. I believe I can get what I want by doing as you say - cuffdiff with each bam file separate. I had in mind something that I would get from arrayQualityMetrics for array data - distribution info for each sample.
|
![]() |
![]() |
![]() |
#4 |
Member
Location: Cambridge, MA Join Date: Feb 2008
Posts: 82
|
![]()
You are correct about the output of cuffdiff and cummeRbund. cuffdiff takes the replicate information and uses this in it's modeling of the sequencing data to more accurately represent the FPKM and confidence intervals for an entire condition (not sample). In the case of biological replicates (which I might add, you are adding correctly in the cuffdiff arguments) the resulting conf_hi and conf_lo intervals represent biological variability for FPKM values from a given condition. In the absense of biological replicates (in which case the boxplot and heatmap would show you individual samples) the conf_hi and conf_lo values would represent estimate of the variability of the given gene/feature based on the fitting of the model, but would not have any measure of biological variability. Hope this helps...
FYI, I am interested in generating/aggregating sample-specific statistics to present to the user in future versions of cummeRbund to try and include more QC-type information. Cheers, Loyal |
![]() |
![]() |
![]() |
#5 | |
Member
Location: Baltimore, MD Join Date: Mar 2011
Posts: 19
|
![]() Quote:
Thanks, Carlos |
|
![]() |
![]() |
![]() |
#6 |
Senior Member
Location: Charlottesville, VA Join Date: May 2011
Posts: 112
|
![]() |
![]() |
![]() |
![]() |
#7 |
Member
Location: phoenix Join Date: Oct 2011
Posts: 59
|
![]() Code:
cuffdiff ref.gtf \ c1.bam,c2.bam,c3.bam,c4.bam \ m1.bam,m2.bam,m3.bam,m4.bam For example, if AMD wet and dry are two sub-phenotypes of AMD, and there are 4 aligned bams from different individuals available for each sub-phenotype, would the above code work. In other words, will the above code produce differential expression values between the two sub-phenotypes? |
![]() |
![]() |
![]() |
#8 |
Member
Location: usa Join Date: Jan 2012
Posts: 21
|
![]()
I had the same question. First time, I treated 3 biological replicates as a group with "," to separate, with -L KO, WT. Second time, I treated every single sample with a condition, but labeled -L KO, KO, KO, WT, WT, WT.
The results are very different. For the KO gene in gene_exp.diff from the first analysis, it has 2 rows (2 different ids, XLOC_016359 and XLOC_016932), no significance. The second analysis, it has 30 rows (the same 2 ids repeated 15 times). It did 15 comparisons, KO1 vs KO2...WT2 vs WT3. all the comparisons between KO and WT are significant (supposed to be, right since it is KO). I wonder how cuffdiff works? In the first setting, -L KO, WT ko1,ko2,ko3 wt1,wt2,wt3, did it only take the ko1 and ko2? Probably not, as when i compared values (value1 and 2) to the second analysis, no matches. In the second analysis, it seemed right and all comparsions are between 2 samples. Should I take the value from single sample, treat it as readout and do some statistics on those? like, median, rank test, p value, q- value? I wonder the difference from 2 analysis, which one is more accurate? |
![]() |
![]() |
![]() |
#9 |
Member
Location: Stockholm Join Date: Jun 2012
Posts: 18
|
![]()
I got a weird thing with the Boxplot. No box with medians displayed, only 'dotsplot'. I tried with or without replicates= T option, both produced the exactly same image as the attached image. Weird, isn't it? at least show me the 'dotsplot' of the replicates. = =!
I have three replicates at four time points of a cell line after drug treatment,3+3+3+3. I ran cuffdiff in this way, T0(3 replicates at time point 0) versus T1(3 replicates), T0(3 replicates)versus T2(3 re), T0(3 replicates)versus T3(3 re). I did't use time series option, which seems a better way to do it after I checked on this forum. I use CummeRbund analyze the three comparison seperately, all three boxplots I got are like the attached image without box and medians. another problem with gene expression barplot of EGFR, see the attached file. Can anyone tell me where is wrong or I did some mistakes? Thank you in advance! |
![]() |
![]() |
![]() |
#10 | |
Member
Location: Cambridge, MA Join Date: Feb 2008
Posts: 82
|
![]() Quote:
-Loyal |
|
![]() |
![]() |
![]() |
#11 | |
Member
Location: Stockholm Join Date: Jun 2012
Posts: 18
|
![]() Quote:
I sent you a message, Please check. |
|
![]() |
![]() |
![]() |
#12 |
Member
Location: Cambridge, MA Join Date: Jan 2010
Posts: 27
|
![]()
Hi,
I was wondering if anyone had more thoughts about dejavu2010's comment... It is still not clear to me if Cuffdiff is built to analyze differential expression for groups of biological replicates. More specifically: We are interested in comparing a group of 10 RNA-Seq samples from healthy individuals with a group of 10 RNA-Seq samples from diseased ones (in the initial phase without accounting for covariates). We do expect the samples to have significant expression variability within the two groups. While the Cuffdiff manual seemed to imply that one should use the "," notation to separate technical samples, we used this notation for the several biological replicates available for each of the two conditions. The result was shocking - there was no gene/transcript that had a non-adjusted p-value < 0.07! This result simply does not make sense, especially since we have prior information about the used samples in terms of expression (we assayed them with microarrays previously and saw a lot of differential expression). Therefore, I believe that we did not use Cuffdiff properly and I am trying to figure out what to do differently to use Cuffdiff with biological replicates. Or maybe Cuffdiff is not really meant for this type of analysis and we should just go with gene-centric differential expression analyses, for example DESeq? Please advise if you have a better understanding of what is going on. Thank you, Alexandra |
![]() |
![]() |
![]() |
#13 |
Member
Location: Orange County Join Date: Oct 2010
Posts: 11
|
![]()
Hi all,
I had exactly the same problems you described. The expression plot I gor looks weird: the histograms of the four samples are stacked one on the top of the other (with confusing colors), while on the manual there were N (N=number of replicates) histograms for each gene. How is possible to get a plot similar to the one on the manual? Thanks, Fed |
![]() |
![]() |
![]() |
#14 | |
Junior Member
Location: Madison - Wisconsin Join Date: Nov 2012
Posts: 5
|
![]()
Alexandra,
Did you find any good answer for your question? I want to compare 8 RNA-seq samples for treatment A with 8 RNA-seq samples for treatment B. I am performing these analyses using Cuffidiff and also edgeR. Surprisingly, using Cuffdiff, I do not find any gene with a nominal P-value < 0.01. However, using edgeR, I find more than 200 genes with FDR < 0.05. Does someone have any idea of what is going on? Any comments or suggestions are welcome. Quote:
|
|
![]() |
![]() |
![]() |
#15 |
Member
Location: Cambridge, MA Join Date: Jan 2010
Posts: 27
|
![]()
Hi fpenagarican,
Unfortunately, I did not receive any relevant answer to my question. Still, I can confirm your results - with Cufflinks (and the corresponding Cuffdiff) v.2.0.0, there were no differentially expressed genes/transcripts in our analyses, while the comparison performed with DESeq showed a fair number of significant genes. In addition, Cuffdiff required a long time to run, which did not help us troubleshoot the problem. Since we wanted to use covariates for our analyses, we ended up focusing our efforts on DESeq (I think edgeR can also accommodate covariates), although this meant we could only perform a gene-centric initial analysis. With the updated versions of Cufflinks, things might be different - this is something I will need to work on in the near future. Sorry I could not help more. If you find additional details, please post them to this thread. Alexandra |
![]() |
![]() |
![]() |
#16 |
Member
Location: Berlin Join Date: Oct 2010
Posts: 71
|
![]()
Hi Alexandra,
Just to show that you are not alone with your experiences: cufflinks 2.0.1 http://seqanswers.com/forums/showthread.php?t=21824 cufflinks 1.3 http://seqanswers.com/forums/showthread.php?t=21020 I'm currently running another analysis using cufflinks 2.0.2, but I do not expect the results to differ. Therefore, I am in the process of switching to edgeR and DEXSeq for DE and AS analysis which already produces results close to the ones obtained via CLC Bio. I'm only using cufflinks for transcript assembly of certain regions now, as in my opinion the abundance estimation is not usable for larger sample groups. Cheers |
![]() |
![]() |
![]() |
Tags |
bioconductor, cuffdiff, cufflinks, cummerbund, rna-seq |
Thread Tools | |
|
|