Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Biological replicates with cuffdiff, cummeRbund

    I'm using cuffdiff for a differential expression analysis and the cummeRbund R package for followup analysis/visualization. I have 5 biological replicates for control, 3 biological replicates for mutant condition. Based on the cuffdiff documentation I ran the following cuffdiff command:

    Code:
    cuffdiff cuffcmp.combined.gtf \
    c1.bam,c2.bam,c3.bam,c4.bam,c5.bam \
    m1.bam,m2.bam,m3.bam
    (I forgot to use the -L option to give my groups labels.)

    My question is, is this the proper way to specify biological replicates? The documentation seems to suggest that each comma separated file is a technical replicate of the same sample, but it isn't clear.

    Later on, when I went to use cummeRbund to do some visualization (e.g. boxplots, heatmaps), I'm only getting results for two samples, like it combined the FPKM values across all the alignments for each comma separated list.

    Code:
    library(cummeRbund)
    cuff <- readCufflinks()
    
    #make boxplot
    csBoxplot(genes(cuff))
    
    #get the top 100 diff expr genes
    gene.diff <- diffData(genes(cuff))
    gene.diff.top <- gene.diff[order(gene.diff$q_value),][1:100,]
    
    # gene ids of top 100 diff expr genes
    myGeneIds <- gene.diff.top$gene_id
    
    # get genes
    myGenes <- getGenes(cuff, myGeneIds)
    
    # make a heatmap
    csHeatmap(myGenes, cluster="both")
    I would expect to have 8 boxplots, one for each sample, and 8 rows in the heatmap, one for each sample. This makes me think the way I'm specifying replicates is asking cuffdiff to treat each item in the comma-separated list as a technical, not biological, replicate.





    Thanks in advance!

  • #2
    I have understood the cuffdiff documentation the same way - each comma separated list of files represent replicates (not necessarily technical) of a sample group.

    To me, cummerbund is giving you what I would expect it to - a box plot per condition that you input to cuffdiff and an indication of the variation within each of those groups. If you want to see the variation between the individual replicates I think you will need to run cuffdiff again treating each replicate as a separate sample. This meakes sense when you look at the output files produced by cuffdiff as these are what cummerbund is reading in.

    Comment


    • #3
      @natstreet: Thanks. I believe I can get what I want by doing as you say - cuffdiff with each bam file separate. I had in mind something that I would get from arrayQualityMetrics for array data - distribution info for each sample.

      Comment


      • #4
        You are correct about the output of cuffdiff and cummeRbund. cuffdiff takes the replicate information and uses this in it's modeling of the sequencing data to more accurately represent the FPKM and confidence intervals for an entire condition (not sample). In the case of biological replicates (which I might add, you are adding correctly in the cuffdiff arguments) the resulting conf_hi and conf_lo intervals represent biological variability for FPKM values from a given condition. In the absense of biological replicates (in which case the boxplot and heatmap would show you individual samples) the conf_hi and conf_lo values would represent estimate of the variability of the given gene/feature based on the fitting of the model, but would not have any measure of biological variability. Hope this helps...

        FYI, I am interested in generating/aggregating sample-specific statistics to present to the user in future versions of cummeRbund to try and include more QC-type information.

        Cheers,
        Loyal

        Comment


        • #5
          Originally posted by lgoff View Post
          FYI, I am interested in generating/aggregating sample-specific statistics to present to the user in future versions of cummeRbund to try and include more QC-type information.
          This would be great, at the moment I'm running cufflinks twice, once with biological replicates as a group, what I really want, and then with each sample as a group to be able to do QC.

          Thanks,
          Carlos

          Comment


          • #6
            Originally posted by Carlos Borroto View Post
            This would be great, at the moment I'm running cufflinks twice, once with biological replicates as a group, what I really want, and then with each sample as a group to be able to do QC.

            Thanks,
            Carlos
            Yes, I'll second that. I'm doing the same thing now.

            Comment


            • #7
              Cuff-diff with different samples as replicates

              Code:
              cuffdiff ref.gtf \
              c1.bam,c2.bam,c3.bam,c4.bam \
              m1.bam,m2.bam,m3.bam,m4.bam
              Some of the groups working with cuff-diff are using using samples in sub-phenotype as biological replicates. Is this acceptable?
              For example, if AMD wet and dry are two sub-phenotypes of AMD, and there are 4 aligned bams from different individuals available for each sub-phenotype, would the above code work. In other words, will the above code produce differential expression values between the two sub-phenotypes?

              Comment


              • #8
                I had the same question. First time, I treated 3 biological replicates as a group with "," to separate, with -L KO, WT. Second time, I treated every single sample with a condition, but labeled -L KO, KO, KO, WT, WT, WT.
                The results are very different. For the KO gene in gene_exp.diff from the first analysis, it has 2 rows (2 different ids, XLOC_016359 and XLOC_016932), no significance. The second analysis, it has 30 rows (the same 2 ids repeated 15 times). It did 15 comparisons, KO1 vs KO2...WT2 vs WT3. all the comparisons between KO and WT are significant (supposed to be, right since it is KO). I wonder how cuffdiff works?
                In the first setting, -L KO, WT ko1,ko2,ko3 wt1,wt2,wt3, did it only take the ko1 and ko2? Probably not, as when i compared values (value1 and 2) to the second analysis, no matches. In the second analysis, it seemed right and all comparsions are between 2 samples. Should I take the value from single sample, treat it as readout and do some statistics on those? like, median, rank test, p value, q- value?

                I wonder the difference from 2 analysis, which one is more accurate?

                Comment


                • #9
                  strange Boxplot in CummeRbund

                  I got a weird thing with the Boxplot. No box with medians displayed, only 'dotsplot'. I tried with or without replicates= T option, both produced the exactly same image as the attached image. Weird, isn't it? at least show me the 'dotsplot' of the replicates. = =!
                  I have three replicates at four time points of a cell line after drug treatment,3+3+3+3. I ran cuffdiff in this way, T0(3 replicates at time point 0) versus T1(3 replicates), T0(3 replicates)versus T2(3 re), T0(3 replicates)versus T3(3 re). I did't use time series option, which seems a better way to do it after I checked on this forum.
                  I use CummeRbund analyze the three comparison seperately, all three boxplots I got are like the attached image without box and medians.

                  another problem with gene expression barplot of EGFR, see the attached file.

                  Can anyone tell me where is wrong or I did some mistakes? Thank you in advance!
                  Attached Files

                  Comment


                  • #10
                    Originally posted by zhuya5607 View Post
                    I got a weird thing with the Boxplot. No box with medians displayed, only 'dotsplot'. I tried with or without replicates= T option, both produced the exactly same image as the attached image. Weird, isn't it? at least show me the 'dotsplot' of the replicates. = =!
                    I have three replicates at four time points of a cell line after drug treatment,3+3+3+3. I ran cuffdiff in this way, T0(3 replicates at time point 0) versus T1(3 replicates), T0(3 replicates)versus T2(3 re), T0(3 replicates)versus T3(3 re). I did't use time series option, which seems a better way to do it after I checked on this forum.
                    I use CummeRbund analyze the three comparison seperately, all three boxplots I got are like the attached image without box and medians.

                    another problem with gene expression barplot of EGFR, see the attached file.

                    Can anyone tell me where is wrong or I did some mistakes? Thank you in advance!
                    It looks like the boxes are there, but most of the signal is compressed around log fpkm of -4. And looking at the plot of EGFR, something is not correct with how the gene was quantified. I wonder if it is a malformed GTF file? Can you send me your gtf and the cuffData.db file created by cummeRbund via email, and I will try to take a look at it and see if I can figure out what is going on.

                    -Loyal

                    Comment


                    • #11
                      Originally posted by lgoff View Post
                      It looks like the boxes are there, but most of the signal is compressed around log fpkm of -4. And looking at the plot of EGFR, something is not correct with how the gene was quantified. I wonder if it is a malformed GTF file? Can you send me your gtf and the cuffData.db file created by cummeRbund via email, and I will try to take a look at it and see if I can figure out what is going on.

                      -Loyal
                      Thank you, Loyal!
                      I sent you a message, Please check.

                      Comment


                      • #12
                        Hi,

                        I was wondering if anyone had more thoughts about dejavu2010's comment... It is still not clear to me if Cuffdiff is built to analyze differential expression for groups of biological replicates.

                        More specifically:
                        We are interested in comparing a group of 10 RNA-Seq samples from healthy individuals with a group of 10 RNA-Seq samples from diseased ones (in the initial phase without accounting for covariates). We do expect the samples to have significant expression variability within the two groups. While the Cuffdiff manual seemed to imply that one should use the "," notation to separate technical samples, we used this notation for the several biological replicates available for each of the two conditions.
                        The result was shocking - there was no gene/transcript that had a non-adjusted p-value < 0.07! This result simply does not make sense, especially since we have prior information about the used samples in terms of expression (we assayed them with microarrays previously and saw a lot of differential expression). Therefore, I believe that we did not use Cuffdiff properly and I am trying to figure out what to do differently to use Cuffdiff with biological replicates. Or maybe Cuffdiff is not really meant for this type of analysis and we should just go with gene-centric differential expression analyses, for example DESeq?

                        Please advise if you have a better understanding of what is going on.

                        Thank you,
                        Alexandra

                        Comment


                        • #13
                          Hi all,

                          I had exactly the same problems you described.
                          The expression plot I gor looks weird: the histograms of the four samples are stacked one on the top of the other (with confusing colors), while on the manual there were N (N=number of replicates) histograms for each gene.

                          How is possible to get a plot similar to the one on the manual?

                          Thanks,
                          Fed

                          Comment


                          • #14
                            Alexandra,

                            Did you find any good answer for your question?

                            I want to compare 8 RNA-seq samples for treatment A with 8 RNA-seq samples for treatment B. I am performing these analyses using Cuffidiff and also edgeR. Surprisingly, using Cuffdiff, I do not find any gene with a nominal P-value < 0.01. However, using edgeR, I find more than 200 genes with FDR < 0.05.

                            Does someone have any idea of what is going on?

                            Any comments or suggestions are welcome.

                            Originally posted by adumitri View Post
                            Hi,

                            I was wondering if anyone had more thoughts about dejavu2010's comment... It is still not clear to me if Cuffdiff is built to analyze differential expression for groups of biological replicates.

                            More specifically:
                            We are interested in comparing a group of 10 RNA-Seq samples from healthy individuals with a group of 10 RNA-Seq samples from diseased ones (in the initial phase without accounting for covariates). We do expect the samples to have significant expression variability within the two groups. While the Cuffdiff manual seemed to imply that one should use the "," notation to separate technical samples, we used this notation for the several biological replicates available for each of the two conditions.
                            The result was shocking - there was no gene/transcript that had a non-adjusted p-value < 0.07! This result simply does not make sense, especially since we have prior information about the used samples in terms of expression (we assayed them with microarrays previously and saw a lot of differential expression). Therefore, I believe that we did not use Cuffdiff properly and I am trying to figure out what to do differently to use Cuffdiff with biological replicates. Or maybe Cuffdiff is not really meant for this type of analysis and we should just go with gene-centric differential expression analyses, for example DESeq?

                            Please advise if you have a better understanding of what is going on.

                            Thank you,
                            Alexandra

                            Comment


                            • #15
                              Hi fpenagarican,

                              Unfortunately, I did not receive any relevant answer to my question. Still, I can confirm your results - with Cufflinks (and the corresponding Cuffdiff) v.2.0.0, there were no differentially expressed genes/transcripts in our analyses, while the comparison performed with DESeq showed a fair number of significant genes. In addition, Cuffdiff required a long time to run, which did not help us troubleshoot the problem.

                              Since we wanted to use covariates for our analyses, we ended up focusing our efforts on DESeq (I think edgeR can also accommodate covariates), although this meant we could only perform a gene-centric initial analysis. With the updated versions of Cufflinks, things might be different - this is something I will need to work on in the near future.

                              Sorry I could not help more. If you find additional details, please post them to this thread.

                              Alexandra

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-27-2024, 06:37 PM
                              0 responses
                              12 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-27-2024, 06:07 PM
                              0 responses
                              11 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              53 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              69 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X