Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • DESeq not giving any differentially expressed genes in RNA seq

    Hello All (and hopefully Simon Anders),

    I've been using DESeq for testing for differential expression between samples. It's always worked great, but for some reason, I don't think it's working correctly for this one data set I have right now. Here's what I'm doing.

    I have three biological samples (CD90, UtESC-, and UtESC+) with three replicates each. I align these to the mouse genome using Tophat, allowing only unique hits. I then use HTSeq with the Ensembl gtf file to find my gene matrix. I then run this matrix through DESeq to find the size factors, variance, and p-values for each pairwise comparison: CD90 to UtESC-, CD90 to UtESC+, and UtESC- to UtESC+.

    Here are the size factors (they vary more than most of my data sets, but I don't think it explains what is happening here):

    > sizeFactors(cds)
    CD901 CD902 CD903 UtESC.1 UtESC.2 UtESC.3 UtESC.1.1 UtESC.2.1
    0.9780371 1.3965205 1.3044467 1.7915295 0.3420241 0.6416566 0.5929506 2.4919984
    UtESC.3.1
    1.0039902

    Here is the plot of variance:



    Seems okay so far. However, when I calculate the p-values and fold changes, I get very strange results. The fold changes seem wrong (does not match up with the RPKM values I calculated) and the last comparison (UtESC+ vs UtESC-) gives almost only adjusted p-values of 1, none below 0.05, while the regular p-values give only about 200 p-values of less than 0.01. I've asked around, searched seqanswers, and looked over my code many times, but I can't find what I'm doing wrong (if anything). Does anybody have any troubleshooting suggestions? For reference, here is my code:

    library("DESeq")
    count_table <- read.table("UtESC_gene_array.txt", header=TRUE, row.names=1)
    design <- data.frame(row.names = colnames(count_table), condition = c("CD901", "CD902", "CD903", "UtESC-1", "UtESC-2", "UtESC-3", "UtESC+1", "UtESC+2", "UtESC+3"),
    libType = c("single-end", "single-end", "single-end", "single-end", "single-end", "single-end", "single-end", "single-end", "single-end") )
    conds <- factor( c("CD", "CD", "CD", "UM", "UM", "UM", "UP", "UP", "UP") )
    cds <- newCountDataSet( count_table, conds )
    cds <- estimateSizeFactors( cds )
    cds <- estimateDispersions( cds, fitType="local" )
    res <- cbind( nbinomTest( cds, "CD", "UM" )[,c(1,6,8)], nbinomTest( cds, "CD", "UP" )[,c(6,8)], nbinomTest( cds, "UM", "UP" )[,c(6,8)] )
    write.table(res, file="UtESC_take2_pvals.txt", append=FALSE, quote=FALSE, sep="\t")

    If you would like a copy of the data for yourself to try, you can grab it here: http://pellegrini.mcdb.ucla.edu/Artur/UtESC_gene_array.txt

    Thank you all in advance for your help!

    Best,
    Artur
    Last edited by Artur Jaroszewicz; 05-03-2012, 11:00 AM.

  • #2
    What do your MA plots look like? Those would reveal if the normalization of library sizes went wonky.

    Sometimes when I think the output of a tool looks weird I'll try another tool. Give edgeR a shot and see what the results look like.
    /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
    Salk Institute for Biological Studies, La Jolla, CA, USA */

    Comment


    • #3
      It seems to look pretty good:


      I'll give EdgeR a shot. Unfortunately, I have my entire pipeline set up with DESeq, so I'd rather keep it consistent across analyses.

      Comment


      • #4
        Dear Artur,

        for sure "unfortunately" is not the right sentiment here
        Anyway, can you produce MA-plots like the one you did above for every pair of samples (i.e. for 9 choose 2 = 36 pairs)? The R function
        Code:
         pairs(log2(exprs(cds)), pch=".")
        does already most of that. If you are comfortable with R, you can add a function as the panel argument to do the 45-degree rotation and add the line at y=0.

        To further explore data quality (and perhaps identify samples that went wrong, and are destroying your power to detect differential expression), try

        Code:
        library("arrayQualityMetrics")
        vst = varianceStabilizingTransformation(cds)
        arrayQualityMetrics(vst, intgroup="condition")
        Please also try method = "per-condition" in the call to estimateDispersions, this could be less conservative if one of the conditions is (for some reason) a lot more noisy than the others.

        PS The URL http://pelegrini.mcdb.ucla.edu/Artur...gene_array.txt did unfortunately not work when I tried ("Browsername can't find the server at pelegrini.mcdb.ucla.edu"). Could it be within your intranet?
        Wolfgang Huber
        EMBL

        Comment


        • #5
          I thought I'd try those snippits myself and I get some errors. Here's my session info:

          Code:
          > sessionInfo()
          R version 2.15.0 (2012-03-30)
          Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
          I do not have the function "varianceStabilizingTransformation" and "exprs(cds)" returns an error:

          > exprs(cds)
          Error in function (classes, fdef, mtable) :
          unable to find an inherited method for function "exprs", for signature "CountDataSet"
          /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
          Salk Institute for Biological Studies, La Jolla, CA, USA */

          Comment


          • #6
            you could try SAMseq and limma-voom()

            which are both capable of using multi-group designs.

            SAMseq (from SAM R-package):
            R: x <- your count matrix
            R: y <- c(rep(1,3), rep(2,3), rep(3,3))
            R: samfit <- SAMseq(x, y, resp.type = "Multiclass")

            limma R package:
            R: Counts <- your count matrix
            R: nf <- calcNormFactors(Counts) [ from the edgeR package ]

            R: f <- factor(targets$Target, levels=c("RNA1","RNA2","RNA3"))
            R: design <- model.matrix(~0+f)
            R: colnames(design) <- c("RNA1","RNA2","RNA3")
            R: y <- voom(Counts,design,plot=TRUE,lib.size=colSums(Counts)*nf)

            Pair-wise comparisons between the three groups
            R: fit <- lmFit(y, design)
            R: contrast.matrix <- makeContrasts(RNA2-RNA1, RNA3-RNA2, RNA3-RNA1,
            levels=design)
            R: fit2 <- contrasts.fit(fit, contrast.matrix)
            R: fit2 <- eBayes(fit2)

            e.g. RNA2 versus RNA1
            R: topTable(fit2, coef=1, adjust="BH")

            Comment


            • #7
              Judging from your dispersion plot, you have many genes with really high variance. Remember that a dispersion of .1 means that the gene typically varies by sqrt(.1)=30% between replicates, and then, to get significance, the change between conditions should be much larger.

              Take a few of these genes with strong log2FoldChange and look at the normalized counts. Do you really see something, or is the variation withing group as strong as between groups?

              (You know, it is entirely possible, of course, that DESeq did not work well on your data, but it is as likely that it is simply right, and the signal-to-noise ratio of your experiment is too bad to allow inference, i.e., that you simply don't have anything.)

              Comment


              • #8
                I am not an expert here so take my word with a grain of salt, but I am currently looking at some cell lines which were sorted based on their expression on a single gene. There is a lot of variance between the cell lines, but there are also real differentially expressed genes between the two groups as well.

                from the DESeq vignette
                As we estimated the dispersion from only two samples, we should expect the estimates to scatter with quite some sampling variance around their true values. Hence, we DESeq should not use the per-gene estimates directly in the test, because using too low dispersion values leads to false positives. Many of the values below the red line are likely to be underestimates of the true dispersions, and hence, it is prudent to instead rather use the fitted value. On the othe hand, not all of he values above the red line are overestimations, and hence, the conservative choice is to keep them instead of replacing them with their fitted values. If you do not like this default behaviour, you can change it with the option sharingMode of estimateDispersions. Note that DESeq orginally (as described in [1]) only used the fitted values (sharingMode="fit-only"). The current default (sharingMode="maximum") is more conservative.
                I was getting no differentially expressed genes but when I changed to sharingMode="fit-only" I got some. We have a 8 biological replicates for each condition so I think this is a reasonable thing to do. My guess is the the whole analysis was being thrown off by the high variance of a lot of other genes. The risk is the less stringent variance calculation is giving us false positives.
                --------------
                Ethan

                Comment


                • #9
                  Originally posted by ETHANol View Post
                  I was getting no differentially expressed genes but when I changed to sharingMode="fit-only" I got some. We have a 8 biological replicates for each condition so I think this is a reasonable thing to do.
                  No, quite the opposite! If you have 8 replicates per condition, you have enough power to get a reliable variance estimate for each gene without having to rely on other genes. So, you switch to sharingMode="gene-est-only", and then, high-variance genes will not reduce your power for any of the other genes.

                  Comment


                  • #10
                    Thanks Simon for the correction, that's what I was trying to say but my cluelessness resulted in what I wrote.
                    --------------
                    Ethan

                    Comment


                    • #11
                      Ok, that's good. Now, you have 14 degrees of freedom for a pooled dispersion estimate (16 samples minus 2 conditions), Artur has only 6 DoF (9 samples minus 3 conditions). This is not quite enough to use "per-gene-est-only", I would say. I'm afraid his problem is really that his signal-to-noise ratio is too bad. Artur, you may need to tell us more about your experiment.

                      Comment


                      • #12
                        Originally posted by sdriscoll View Post
                        I thought I'd try those snippits myself and I get some errors. I do not have the function "varianceStabilizingTransformation" and "exprs(cds)" returns an error:
                        Dear Sdriscoll

                        thank you for checking and reporting these errors. My apologies. That comes from sending untested code snippets late at night. Here's an example that works:

                        Code:
                        library("arrayQualityMetrics")
                        library("DESeq")
                        library("pasilla")
                        
                        data("pasillaGenes")
                        
                        pairs(log2(counts(pasillaGenes)+1),
                          panel=function(x, y, ...) {
                            points((x+y)/2, y-x, pch=".")
                            abline(h=0, col="blue")
                        }, xlim=c(0, 16), ylim=c(-5,5))
                        
                        cds = estimateDispersions(estimateSizeFactors(pasillaGenes), method="blind")
                        vst = DESeq:::varianceStabilizingTransformation(cds)
                        arrayQualityMetrics(vst, intgroup="condition",
                            reporttitle="Quality metrics report for pasillaGenes", force=TRUE)
                        
                        
                        
                        > sessionInfo()
                        R version 2.15.0 (2012-03-30)
                        Platform: x86_64-apple-darwin10.8.0/x86_64 (64-bit)
                        
                        locale:
                        [1] C
                        
                        attached base packages:
                        [1] stats     graphics  grDevices utils     datasets  methods   base     
                        
                        other attached packages:
                        [1] pasilla_0.2.11             DEXSeq_1.2.0              
                        [3] DESeq_1.9.5                locfit_1.5-8              
                        [5] Biobase_2.16.0             BiocGenerics_0.2.0        
                        [7] arrayQualityMetrics_3.12.0 fortunes_1.5-0            
                        
                        loaded via a namespace (and not attached):
                         [1] AnnotationDbi_1.18.0  BeadDataPackR_1.8.0   BiocInstaller_1.4.4  
                         [4] Biostrings_2.24.1     Cairo_1.5-1           DBI_0.2-5            
                         [7] Hmisc_3.9-3           IRanges_1.14.2        KernSmooth_2.23-7    
                        [10] RColorBrewer_1.0-5    RCurl_1.91-1          RSQLite_0.11.1       
                        [13] SVGAnnotation_0.9-0   XML_3.9-4             affy_1.34.0          
                        [16] affyPLM_1.32.0        affyio_1.24.0         annotate_1.34.0      
                        [19] beadarray_2.6.0       biomaRt_2.12.0        cluster_1.14.2       
                        [22] colorspace_1.1-1      genefilter_1.38.0     geneplotter_1.34.0   
                        [25] grid_2.15.0           hwriter_1.3           lattice_0.20-6       
                        [28] latticeExtra_0.6-19   limma_3.12.0          plyr_1.7.1           
                        [31] preprocessCore_1.18.0 reshape2_1.2.1        setRNG_2011.11-2     
                        [34] splines_2.15.0        statmod_1.4.14        stats4_2.15.0        
                        [37] stringr_0.6           survival_2.36-14      tools_2.15.0         
                        [40] vsn_3.24.0            xtable_1.7-0          zlibbioc_1.2.0
                        You will need DESeq>=1.9.4 for this (i.e. a version more recent than the last release of Bioconductor from April 2012). If you are using R 2.15, and are not normally subscribing to the devel branch of Bioconductor, then the best way to install DESeq is via

                        Code:
                        source("http://bioconductor.org/biocLite.R"); useDevel(TRUE); biocLite("DESeq"); useDevel(old)
                        The above mentioned prefix DESeq::: in the call to varianceStabilizingTransformation is only needed for DESeq <=1.9.5, in more recent versions this is properly exported.

                        I hope this helps, best wishes
                        Wolfgang
                        Wolfgang Huber
                        EMBL

                        Comment


                        • #13
                          Thanks for the explanation. Late night code snippets...that's funny and it happens to everyone that writes code.
                          /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                          Salk Institute for Biological Studies, La Jolla, CA, USA */

                          Comment


                          • #14
                            Delivery of code snipits after midnight on SEQanswers. You deserve an award for that or at least to have your next grant approved.
                            --------------
                            Ethan

                            Comment


                            • #15
                              Hi everyone,

                              Thank you so much for your input so far. Simon, I am beginning to think that the replicates may indeed be too noisy to infer anything. Still, it's strange. I have a question for you about the way you determine fold change. I see that you estimate the total abundance of a given gene by dividing the raw reads by the size factors, then find the unweighted average of the replicates to obtain a mean level for each sample. This is the only way to do the differential testing, from what I understand. However, it leads to some results that I'm not sure I fully grasp or agree with. You calculate fold change, etc., from these numbers. Isn't it more accurate to take the weighted mean of these numbers? In my analysis, it's caused some problems. I calculate RPKM by taking the total counts per gene for each sample and dividing by the total number of reads in the sample (and gene length). This gives replicates with more reads more strength, which I think should be the case. Just FYI, my size factors here are:
                              CD901 0.9780371
                              CD902 1.3965205
                              CD903 1.3044467
                              UtESC.1 1.7915295
                              UtESC.2 0.3420241
                              UtESC.3 0.6416566
                              UtESC.1.1 0.5929506
                              UtESC.2.1 2.4919984
                              UtESC.3.1 1.0039902
                              So in the last sample, replicates 1 and 2 have a five-fold size difference, yet the same weight in the differential test. Because of this discrepancy, I get strange artifacts in certain differentially expressed genes which have sample 1 higher than sample 2 in one method of estimation and lower in the other method. Do you have any explanations or suggestions for what I should do differently?

                              Wolfgang, I realized that I had the wrong link. I edited it, and now it should work. And you're right, I should feel very fortunate for having DESeq in my pipeline =]

                              Here's some background information sent from my colleague:

                              We study the mouse reproductive system, and have identified a signature for adult mouse endometrial epithelial cells. We have collected FACS sorted fractions of the epithelial stem cell fraction (UtESC+), the epithelial non-stem cell fraction (UtESC-), and the stromal fraction (CD90).

                              I re-ran the test using
                              > cds <- estimateDispersions( cds, method="per-condition" )
                              and looked at the p-values again. There are only 13 genes of ~30,000 with adjusted p-values of less than 0.01, and a more reasonable 309 genes with non-adjusted p-values of less than 0.01. These are much lower numbers than I usually get, but it may be the only thing I can do with this data. Do you agree?

                              Thank you for your continued help,
                              Artur

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-27-2024, 06:37 PM
                              0 responses
                              13 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-27-2024, 06:07 PM
                              0 responses
                              11 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              53 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              69 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X