SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
running Cuffdiff with biological replicates Jane M Bioinformatics 13 07-15-2013 07:55 AM
using Cuffdiff with biological replicates Jane M RNA Sequencing 0 09-01-2011 12:42 AM
overdispersion and biological replicates shilez Bioinformatics 3 08-29-2011 07:43 AM
overdispersion and biological replicates shilez RNA Sequencing 0 08-25-2011 06:10 PM
cuffdiff with biological replicates PFS Bioinformatics 1 06-14-2011 06:51 PM

Reply
 
Thread Tools
Old 12-29-2011, 01:15 PM   #1
turnersd
Senior Member
 
Location: Charlottesville, VA

Join Date: May 2011
Posts: 112
Question Biological replicates with cuffdiff, cummeRbund

I'm using cuffdiff for a differential expression analysis and the cummeRbund R package for followup analysis/visualization. I have 5 biological replicates for control, 3 biological replicates for mutant condition. Based on the cuffdiff documentation I ran the following cuffdiff command:

Code:
cuffdiff cuffcmp.combined.gtf \
c1.bam,c2.bam,c3.bam,c4.bam,c5.bam \
m1.bam,m2.bam,m3.bam
(I forgot to use the -L option to give my groups labels.)

My question is, is this the proper way to specify biological replicates? The documentation seems to suggest that each comma separated file is a technical replicate of the same sample, but it isn't clear.

Later on, when I went to use cummeRbund to do some visualization (e.g. boxplots, heatmaps), I'm only getting results for two samples, like it combined the FPKM values across all the alignments for each comma separated list.

Code:
library(cummeRbund)
cuff <- readCufflinks()

#make boxplot
csBoxplot(genes(cuff))

#get the top 100 diff expr genes
gene.diff <- diffData(genes(cuff))
gene.diff.top <- gene.diff[order(gene.diff$q_value),][1:100,]

# gene ids of top 100 diff expr genes
myGeneIds <- gene.diff.top$gene_id

# get genes
myGenes <- getGenes(cuff, myGeneIds)

# make a heatmap
csHeatmap(myGenes, cluster="both")
I would expect to have 8 boxplots, one for each sample, and 8 rows in the heatmap, one for each sample. This makes me think the way I'm specifying replicates is asking cuffdiff to treat each item in the comma-separated list as a technical, not biological, replicate.





Thanks in advance!
turnersd is offline   Reply With Quote
Old 12-29-2011, 11:23 PM   #2
natstreet
Member
 
Location: Sweden

Join Date: Nov 2009
Posts: 83
Default

I have understood the cuffdiff documentation the same way - each comma separated list of files represent replicates (not necessarily technical) of a sample group.

To me, cummerbund is giving you what I would expect it to - a box plot per condition that you input to cuffdiff and an indication of the variation within each of those groups. If you want to see the variation between the individual replicates I think you will need to run cuffdiff again treating each replicate as a separate sample. This meakes sense when you look at the output files produced by cuffdiff as these are what cummerbund is reading in.
natstreet is offline   Reply With Quote
Old 12-30-2011, 02:51 AM   #3
turnersd
Senior Member
 
Location: Charlottesville, VA

Join Date: May 2011
Posts: 112
Default

@natstreet: Thanks. I believe I can get what I want by doing as you say - cuffdiff with each bam file separate. I had in mind something that I would get from arrayQualityMetrics for array data - distribution info for each sample.
turnersd is offline   Reply With Quote
Old 12-30-2011, 11:13 AM   #4
lgoff
Member
 
Location: Cambridge, MA

Join Date: Feb 2008
Posts: 82
Default

You are correct about the output of cuffdiff and cummeRbund. cuffdiff takes the replicate information and uses this in it's modeling of the sequencing data to more accurately represent the FPKM and confidence intervals for an entire condition (not sample). In the case of biological replicates (which I might add, you are adding correctly in the cuffdiff arguments) the resulting conf_hi and conf_lo intervals represent biological variability for FPKM values from a given condition. In the absense of biological replicates (in which case the boxplot and heatmap would show you individual samples) the conf_hi and conf_lo values would represent estimate of the variability of the given gene/feature based on the fitting of the model, but would not have any measure of biological variability. Hope this helps...

FYI, I am interested in generating/aggregating sample-specific statistics to present to the user in future versions of cummeRbund to try and include more QC-type information.

Cheers,
Loyal
lgoff is offline   Reply With Quote
Old 01-03-2012, 12:23 PM   #5
Carlos Borroto
Member
 
Location: Baltimore, MD

Join Date: Mar 2011
Posts: 19
Default

Quote:
Originally Posted by lgoff View Post
FYI, I am interested in generating/aggregating sample-specific statistics to present to the user in future versions of cummeRbund to try and include more QC-type information.
This would be great, at the moment I'm running cufflinks twice, once with biological replicates as a group, what I really want, and then with each sample as a group to be able to do QC.

Thanks,
Carlos
Carlos Borroto is offline   Reply With Quote
Old 01-03-2012, 12:37 PM   #6
turnersd
Senior Member
 
Location: Charlottesville, VA

Join Date: May 2011
Posts: 112
Default

Quote:
Originally Posted by Carlos Borroto View Post
This would be great, at the moment I'm running cufflinks twice, once with biological replicates as a group, what I really want, and then with each sample as a group to be able to do QC.

Thanks,
Carlos
Yes, I'll second that. I'm doing the same thing now.
turnersd is offline   Reply With Quote
Old 03-05-2012, 09:56 AM   #7
vyellapa
Member
 
Location: phoenix

Join Date: Oct 2011
Posts: 59
Default Cuff-diff with different samples as replicates

Code:
cuffdiff ref.gtf \
c1.bam,c2.bam,c3.bam,c4.bam \
m1.bam,m2.bam,m3.bam,m4.bam
Some of the groups working with cuff-diff are using using samples in sub-phenotype as biological replicates. Is this acceptable?
For example, if AMD wet and dry are two sub-phenotypes of AMD, and there are 4 aligned bams from different individuals available for each sub-phenotype, would the above code work. In other words, will the above code produce differential expression values between the two sub-phenotypes?
vyellapa is offline   Reply With Quote
Old 05-22-2012, 08:49 AM   #8
dejavu2010
Member
 
Location: usa

Join Date: Jan 2012
Posts: 21
Default

I had the same question. First time, I treated 3 biological replicates as a group with "," to separate, with -L KO, WT. Second time, I treated every single sample with a condition, but labeled -L KO, KO, KO, WT, WT, WT.
The results are very different. For the KO gene in gene_exp.diff from the first analysis, it has 2 rows (2 different ids, XLOC_016359 and XLOC_016932), no significance. The second analysis, it has 30 rows (the same 2 ids repeated 15 times). It did 15 comparisons, KO1 vs KO2...WT2 vs WT3. all the comparisons between KO and WT are significant (supposed to be, right since it is KO). I wonder how cuffdiff works?
In the first setting, -L KO, WT ko1,ko2,ko3 wt1,wt2,wt3, did it only take the ko1 and ko2? Probably not, as when i compared values (value1 and 2) to the second analysis, no matches. In the second analysis, it seemed right and all comparsions are between 2 samples. Should I take the value from single sample, treat it as readout and do some statistics on those? like, median, rank test, p value, q- value?

I wonder the difference from 2 analysis, which one is more accurate?
dejavu2010 is offline   Reply With Quote
Old 06-04-2012, 01:26 AM   #9
zhuya5607
Member
 
Location: Stockholm

Join Date: Jun 2012
Posts: 18
Default strange Boxplot in CummeRbund

I got a weird thing with the Boxplot. No box with medians displayed, only 'dotsplot'. I tried with or without replicates= T option, both produced the exactly same image as the attached image. Weird, isn't it? at least show me the 'dotsplot' of the replicates. = =!
I have three replicates at four time points of a cell line after drug treatment,3+3+3+3. I ran cuffdiff in this way, T0(3 replicates at time point 0) versus T1(3 replicates), T0(3 replicates)versus T2(3 re), T0(3 replicates)versus T3(3 re). I did't use time series option, which seems a better way to do it after I checked on this forum.
I use CummeRbund analyze the three comparison seperately, all three boxplots I got are like the attached image without box and medians.

another problem with gene expression barplot of EGFR, see the attached file.

Can anyone tell me where is wrong or I did some mistakes? Thank you in advance!
Attached Images
File Type: png Rplot_H0-H24.png (13.4 KB, 70 views)
File Type: png Rplot-EGFR.png (7.8 KB, 73 views)
zhuya5607 is offline   Reply With Quote
Old 06-04-2012, 04:04 AM   #10
lgoff
Member
 
Location: Cambridge, MA

Join Date: Feb 2008
Posts: 82
Default

Quote:
Originally Posted by zhuya5607 View Post
I got a weird thing with the Boxplot. No box with medians displayed, only 'dotsplot'. I tried with or without replicates= T option, both produced the exactly same image as the attached image. Weird, isn't it? at least show me the 'dotsplot' of the replicates. = =!
I have three replicates at four time points of a cell line after drug treatment,3+3+3+3. I ran cuffdiff in this way, T0(3 replicates at time point 0) versus T1(3 replicates), T0(3 replicates)versus T2(3 re), T0(3 replicates)versus T3(3 re). I did't use time series option, which seems a better way to do it after I checked on this forum.
I use CummeRbund analyze the three comparison seperately, all three boxplots I got are like the attached image without box and medians.

another problem with gene expression barplot of EGFR, see the attached file.

Can anyone tell me where is wrong or I did some mistakes? Thank you in advance!
It looks like the boxes are there, but most of the signal is compressed around log fpkm of -4. And looking at the plot of EGFR, something is not correct with how the gene was quantified. I wonder if it is a malformed GTF file? Can you send me your gtf and the cuffData.db file created by cummeRbund via email, and I will try to take a look at it and see if I can figure out what is going on.

-Loyal
lgoff is offline   Reply With Quote
Old 06-04-2012, 05:44 AM   #11
zhuya5607
Member
 
Location: Stockholm

Join Date: Jun 2012
Posts: 18
Default

Quote:
Originally Posted by lgoff View Post
It looks like the boxes are there, but most of the signal is compressed around log fpkm of -4. And looking at the plot of EGFR, something is not correct with how the gene was quantified. I wonder if it is a malformed GTF file? Can you send me your gtf and the cuffData.db file created by cummeRbund via email, and I will try to take a look at it and see if I can figure out what is going on.

-Loyal
Thank you, Loyal!
I sent you a message, Please check.
zhuya5607 is offline   Reply With Quote
Old 06-19-2012, 01:28 PM   #12
adumitri
Member
 
Location: Cambridge, MA

Join Date: Jan 2010
Posts: 27
Default

Hi,

I was wondering if anyone had more thoughts about dejavu2010's comment... It is still not clear to me if Cuffdiff is built to analyze differential expression for groups of biological replicates.

More specifically:
We are interested in comparing a group of 10 RNA-Seq samples from healthy individuals with a group of 10 RNA-Seq samples from diseased ones (in the initial phase without accounting for covariates). We do expect the samples to have significant expression variability within the two groups. While the Cuffdiff manual seemed to imply that one should use the "," notation to separate technical samples, we used this notation for the several biological replicates available for each of the two conditions.
The result was shocking - there was no gene/transcript that had a non-adjusted p-value < 0.07! This result simply does not make sense, especially since we have prior information about the used samples in terms of expression (we assayed them with microarrays previously and saw a lot of differential expression). Therefore, I believe that we did not use Cuffdiff properly and I am trying to figure out what to do differently to use Cuffdiff with biological replicates. Or maybe Cuffdiff is not really meant for this type of analysis and we should just go with gene-centric differential expression analyses, for example DESeq?

Please advise if you have a better understanding of what is going on.

Thank you,
Alexandra
adumitri is offline   Reply With Quote
Old 07-20-2012, 10:17 AM   #13
ftorri
Member
 
Location: Orange County

Join Date: Oct 2010
Posts: 11
Default

Hi all,

I had exactly the same problems you described.
The expression plot I gor looks weird: the histograms of the four samples are stacked one on the top of the other (with confusing colors), while on the manual there were N (N=number of replicates) histograms for each gene.

How is possible to get a plot similar to the one on the manual?

Thanks,
Fed
ftorri is offline   Reply With Quote
Old 11-17-2012, 01:31 PM   #14
fpenagarican
Junior Member
 
Location: Madison - Wisconsin

Join Date: Nov 2012
Posts: 5
Default

Alexandra,

Did you find any good answer for your question?

I want to compare 8 RNA-seq samples for treatment A with 8 RNA-seq samples for treatment B. I am performing these analyses using Cuffidiff and also edgeR. Surprisingly, using Cuffdiff, I do not find any gene with a nominal P-value < 0.01. However, using edgeR, I find more than 200 genes with FDR < 0.05.

Does someone have any idea of what is going on?

Any comments or suggestions are welcome.

Quote:
Originally Posted by adumitri View Post
Hi,

I was wondering if anyone had more thoughts about dejavu2010's comment... It is still not clear to me if Cuffdiff is built to analyze differential expression for groups of biological replicates.

More specifically:
We are interested in comparing a group of 10 RNA-Seq samples from healthy individuals with a group of 10 RNA-Seq samples from diseased ones (in the initial phase without accounting for covariates). We do expect the samples to have significant expression variability within the two groups. While the Cuffdiff manual seemed to imply that one should use the "," notation to separate technical samples, we used this notation for the several biological replicates available for each of the two conditions.
The result was shocking - there was no gene/transcript that had a non-adjusted p-value < 0.07! This result simply does not make sense, especially since we have prior information about the used samples in terms of expression (we assayed them with microarrays previously and saw a lot of differential expression). Therefore, I believe that we did not use Cuffdiff properly and I am trying to figure out what to do differently to use Cuffdiff with biological replicates. Or maybe Cuffdiff is not really meant for this type of analysis and we should just go with gene-centric differential expression analyses, for example DESeq?

Please advise if you have a better understanding of what is going on.

Thank you,
Alexandra
fpenagarican is offline   Reply With Quote
Old 11-19-2012, 05:32 AM   #15
adumitri
Member
 
Location: Cambridge, MA

Join Date: Jan 2010
Posts: 27
Default

Hi fpenagarican,

Unfortunately, I did not receive any relevant answer to my question. Still, I can confirm your results - with Cufflinks (and the corresponding Cuffdiff) v.2.0.0, there were no differentially expressed genes/transcripts in our analyses, while the comparison performed with DESeq showed a fair number of significant genes. In addition, Cuffdiff required a long time to run, which did not help us troubleshoot the problem.

Since we wanted to use covariates for our analyses, we ended up focusing our efforts on DESeq (I think edgeR can also accommodate covariates), although this meant we could only perform a gene-centric initial analysis. With the updated versions of Cufflinks, things might be different - this is something I will need to work on in the near future.

Sorry I could not help more. If you find additional details, please post them to this thread.

Alexandra
adumitri is offline   Reply With Quote
Old 11-19-2012, 05:59 AM   #16
rboettcher
Member
 
Location: Berlin

Join Date: Oct 2010
Posts: 71
Default

Hi Alexandra,

Just to show that you are not alone with your experiences:

cufflinks 2.0.1
http://seqanswers.com/forums/showthread.php?t=21824

cufflinks 1.3
http://seqanswers.com/forums/showthread.php?t=21020

I'm currently running another analysis using cufflinks 2.0.2, but I do not expect the results to differ.
Therefore, I am in the process of switching to edgeR and DEXSeq for DE and AS analysis which already produces results close to the ones obtained via CLC Bio.

I'm only using cufflinks for transcript assembly of certain regions now, as in my opinion the abundance estimation is not usable for larger sample groups.

Cheers
rboettcher is offline   Reply With Quote
Reply

Tags
bioconductor, cuffdiff, cufflinks, cummerbund, rna-seq

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:23 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO