Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • sebastion
    Junior Member
    • Nov 2011
    • 6

    DESeq V DESeq2

    Hi

    I get alot more DE genes when I analyse my RNA Seq dataset using the current version of DESeq versus DESeq2 -just using the default settings for both.

    Its a simple design of 10 disease V 10 control samples but I get an extra ~1500 genes identified as significantly DE at an FDR of 5% using DESeq2.

    I know DESeq2 has more powerful statistic etc but is this great of a difference likely or could I be doing something wrong...(more likely!) .

    I'd appreciated any advice!

    thanks!
  • Simon Anders
    Senior Member
    • Feb 2010
    • 995

    #2
    This is actually interesting. Would you mind sending me your count table? (Blank out gene names and obscure sample names if your data is confidential.)

    Comment

    • NicoBxl
      not just another member
      • Aug 2010
      • 264

      #3
      Yes that seems interesting to analyze. 1500 extra DE genes seems a lot.. I suppose you used the same count matrix as input ?

      Comment

      • sebastion
        Junior Member
        • Nov 2011
        • 6

        #4
        Thanks Simon -I've sent you the count table via email if that's OK.

        Yep its the same count matrix.

        Comment

        • NateP
          Junior Member
          • Sep 2011
          • 9

          #5
          I am actually seeing something quite similar.
          Using the same counts table, DESeq2 is giving vastly more differentially expressed genes than DESeq (at FDR 0.05).

          (3 biological rep vs 3 biological rep comparisons)

          Set1: 5469 DEG with DESeq, 13307 DEG with DESeq2
          Set2: 15123 DEG with DESeq, 22127 DEG with DESeq2
          Set3: 3818 DEG with DESeq, 10221 DEG with DESeq2

          That just seems a bit odd to me.
          (Using R 3.0.1)

          Comment

          • glados
            Member
            • Mar 2012
            • 59

            #6
            Same thing for me, getting A LOT of more significantly different genes (tried with 2 different projects). Though I assumed that was intended.

            Comment

            • Simon Anders
              Senior Member
              • Feb 2010
              • 995

              #7
              Well, yes, it is intended. DESeq had a rather ugly hack to ensure type-I error control, which cost a lot of power, and fixing this was the core aim of the development of DESeq2.

              It's only that for the datasets we used to test it on the improvement was not as dramatic - hence my surprise. However, we used designed experiments with cell cultures to test, and we expect that the improvement will be more pronounced for experiments involving patient data with different genotype, which show stronger within-group variability. So, I guess this is all as it should be, and our improvement was just really needed.

              Comment

              • phred
                Member
                • May 2012
                • 11

                #8
                hi simon

                as a converse to the above

                i used deseq to analyse data from human t cells from controls (n=6) vs disease (n=5). the analysis that showed differential expression was based on a dispersion estimate using the method "per-condition" and sharing mode "fit-only"

                now having analysed the same samples using the standard deseq2 analysis, there is no differential expression between the 2 groups (at a fdr<0.1).

                is it reasonable to conclude that there is in fact no difference between groups as indicated by the deseq2 analysis and that the only reason deseq gave a significant result was becuase the tweaks above rendered the analysis less robust?

                Comment

                • Simon Anders
                  Senior Member
                  • Feb 2010
                  • 995

                  #9
                  Why did you use the "fit-only" mode? Ysing DESeq with "fit-only" is not robust, which is why we advise against using it except in special cases. The more robust analysis with DESeq2 is most likely closer to the truth.

                  Comment

                  • choishingwan
                    Member
                    • Feb 2012
                    • 21

                    #10
                    Hi there, in my case, I got some genes that were significant (padj of 0.000000000495 in DESeq and also 2.06326E-15 in edgeR) but got NA in DESeq2. I am not sure what is the particular behaviour of NA or when will DESeq2 return NA. I used the standard pipeline of DESeq2 as written in the manual and the counts for this gene are as follow:
                    Case:
                    7 30 60 24 35
                    Control:
                    127 188 101 236 116

                    And I did get some output from DESeq2, not all fields are NA:
                    baseMean log2FoldChange lfcSE pvalue padj
                    89.18039938 1.87590805 0.299465406 NA NA

                    Comment

                    • Michael Love
                      Senior Member
                      • Jul 2013
                      • 333

                      #11
                      hi choishingwan,

                      Could you provide the version number for DESeq2? Also could you provide the output of mcols( dds )[geneNumber, ]

                      It could either be due to a count outlier (although it doesn't appear as if any of these are outliers) or due to independent filtering if you are using the latest development branch.

                      I plan to add a metadata column which would explain why a p-value is set to NA.

                      Comment

                      • choishingwan
                        Member
                        • Feb 2012
                        • 21

                        #12
                        Hi Michael,

                        Thank you for your reply, my sessionInfo are as follow:

                        Code:
                        R version 3.0.1 (2013-05-16)
                        Platform: x86_64-w64-mingw32/x64 (64-bit)
                        
                        locale:
                        [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                           LC_TIME=English_United States.1252    
                        
                        attached base packages:
                        [1] parallel  stats     graphics  grDevices utils     datasets  methods   base     
                        
                        other attached packages:
                        [1] DESeq2_1.0.18           RcppArmadillo_0.3.900.0 Rcpp_0.10.4             lattice_0.20-15         Biobase_2.20.1          GenomicRanges_1.12.4    IRanges_1.18.2          BiocGenerics_0.6.0     
                        
                        loaded via a namespace (and not attached):
                         [1] annotate_1.38.0      AnnotationDbi_1.22.6 DBI_0.2-7            genefilter_1.42.0    grid_3.0.1           locfit_1.5-9.1       RColorBrewer_1.0-5   RSQLite_0.11.4       splines_3.0.1        stats4_3.0.1         survival_2.37-4     
                        [12] XML_3.98-1.1         xtable_1.7-1

                        When using mcols(dds)[geneNumber,], I've got the following:

                        Code:
                        DataFrame with 1 row and 26 columns
                           baseMean   baseVar   allZero dispGeneEst dispGeneEstConv    dispFit dispersion  dispIter  dispConv dispOutlier   dispMAP Intercept condition_Control_vs_Case SE_Intercept SE_condition_Control_vs_Case WaldStatistic_Intercept
                          <numeric> <numeric> <logical>   <numeric>       <logical>  <numeric>  <numeric> <numeric> <logical>   <logical> <numeric> <numeric>                 <numeric>    <numeric>                    <numeric>               <numeric>
                        1   89.1804  4592.645     FALSE   0.1996928            TRUE 0.02251914  0.1365872         9      TRUE       FALSE 0.1365872  5.166256                  1.875908    0.2398174                    0.2994654                21.54245
                          WaldStatistic_condition_Control_vs_Case WaldPvalue_Intercept WaldPvalue_condition_Control_vs_Case WaldAdjPvalue_Intercept WaldAdjPvalue_condition_Control_vs_Case  betaConv  betaIter  deviance  maxCooks cooksOutlier
                                                        <numeric>            <numeric>                            <numeric>               <numeric>                               <numeric> <logical> <numeric> <numeric> <numeric>    <logical>
                        1                                 6.26419                   NA                                   NA                      NA                                      NA      TRUE         9  95.78522  2.211495         TRUE


                        It seems messy when I paste the output here, is there anyway for me to make it more organized?
                        Last edited by choishingwan; 07-24-2013, 11:48 PM. Reason: Just make the output more organize

                        Comment

                        • Simon Anders
                          Senior Member
                          • Feb 2010
                          • 995

                          #13
                          It seems messy when I paste the output here, is there anyway for me to make it more organized?
                          Use the CODE markup:
                          Code:
                           Put "[CODE_]" (without the "_") before and "[/CODE_]" 
                          (again, without the "_") after the block to make it appear in monospace and
                          with line breaks at the right places.
                          .

                          Comment

                          • Michael Love
                            Senior Member
                            • Jul 2013
                            • 333

                            #14
                            The Cook's cutoff used here is the .75 quantile of the F(p, m-p) distribution. So for 2 parameters (intercept and condition) and 10 samples, we use a cutoff of:

                            > qf(.75, 2, 10 - 2)
                            [1] 1.656854

                            The sample with count 60 is getting 2.211495, a high estimate of Cook's distance (meaning the log fold change would change a lot if this sample were removed), although from looking at the counts I agree that this is not desirable behavior to give this gene a p-value of NA. I will investigate how we might better set this cutoff.

                            You can set a higher cutoff manually or turn off the filtering by Cook's with:

                            dds <- estimateSizeFactors(dds)
                            dds <- estimateDispersions(dds)
                            # or use a higher cutoff:
                            p <- 2
                            m <- 10
                            dds <- nbinomWaldTest(dds, cooksCutoff=qf(.95,p,m-p))

                            # turn off filtering:
                            dds <- nbinomWaldTest(dds, cooksCutoff=FALSE)
                            Last edited by Michael Love; 07-25-2013, 02:22 AM.

                            Comment

                            • choishingwan
                              Member
                              • Feb 2012
                              • 21

                              #15
                              I see, maybe I should try turning off the Cook's filtering and see if these NA still persist.

                              Thank you

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                06-02-2026, 10:05 AM
                              • SEQadmin2
                                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                by SEQadmin2


                                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                Introduction

                                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                05-22-2026, 06:42 AM
                              • SEQadmin2
                                Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                by SEQadmin2

                                Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                05-06-2026, 09:04 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Yesterday, 08:59 AM
                              0 responses
                              13 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 12:03 PM
                              0 responses
                              22 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 11:40 AM
                              0 responses
                              19 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-28-2026, 11:40 AM
                              0 responses
                              31 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...