Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    By turning off the cook's cutoff I was able to obtain a p-value for all the genes that have a p-value from DESeq and EdgeR but not from DESeq2. However, I also notice something interesting: Some of the genes have a different p-value before and after turning off the Cook's Cutoff (The change was small though)

    using mcols shows that all the values are the same. I guess the difference is not that big so it should be fine.
    Code:
    On	Off
    2.0392E-06	1.91466E-06
    6.24279E-06	5.89022E-06
    3.3727E-06	3.17301E-06
    8.30502E-06	7.83552E-06
    2.62679E-06	2.46954E-06
    3.31702E-06	3.11984E-06
    5.42979E-06	5.11889E-06
    2.6574E-06	2.49848E-06
    3.11848E-06	2.93252E-06
    1.50445E-05	1.42226E-05
    5.14826E-06	4.85033E-06
    2.46503E-06	2.31563E-06
    3.11574E-06	2.92985E-06
    3.64453E-06	3.43143E-06

    Comment


    • #17
      I would guess that these are adjusted p-values?

      By excluding the genes with an outlier count (which is independent from the test statistic under the null hypothesis), we get a small boost in the number of genes with adjusted p-value < alpha.

      Comment


      • #18
        Yes, that's true, those are the padj values

        Comment


        • #19
          why are DEGs obtained with DESeq no longer DEGs in results from DESeq2?

          We have also found that DESeq2 is giving many more differentially expressed genes than DESeq (at adjusted p-value 0.05).
          [on our 2 sets of 3 biological replicates at 2 different timepoints]

          Given the explanation by Simon Anders, if I understand correctly, this increase in DEGs in DESeq2 is reflective of too tight a constriction on false positives in the original DESeq so that actual DEGs were being missed.

          However I am curious as to why some of the DEGs we obtained with DESeq are no longer classified as DEGs in the results from DESeq2.

          For example
          if you compare the new DEGs [DESeq2] with the original DEGs [DESeq] (we used a cut-off of log2FC of <-1 to >+1), there is always quite a large
          proportion of genes that are different:
          eg. in our T80 dataset 55 of 103 under-expressed DEGs are different (58 are the same)
          However in our T80 dataset NONE of 29 over-expressed DEGs is the same

          Has anyone else had a similar experience with their data?

          Thanks!
          Tanya

          Comment


          • #20
            Hi Tanya,

            From what I understand, the log2FC provided by the DESeq2 might be a bit different from that produced from the DESeq programme because it tried to perform a shrinkage according to the gene count. If you look at the MA plot on page 8 of the tutorial, it states that:

            "... shows the shrinkage of log2 fold changes resulting from the incorporation of zero-centered normal prior. The shrinkage is greater for the log2 fold change estimates from genes with low counts and high dispersion, as can be seen by the narrowing of spread of leftmost points in the right plot."

            Which suggest that the log2 fold change is changed according to the gene counts.

            Maybe you can check if all those genes that were significant in DESeq but not in DESeq2 have a low count? If so, than that might explain the difference

            Comment


            • #21
              chocicingwan is right: DESeq1 was too strict with its dispersion estimates, but the standard way of estimating log fold changes (as also used in DESeq1) partly counters this.

              Better use DESeq2's new functionality to test directly against a threshold:

              res <- results( dds, lfcThreshold=1 )
              subset( res$padj < .1 )

              will give you all genes for which the data provides signficant evidence (at 10% FDR) that their actual log2 fold change is at least +/-1, while the "traditional" approach of using

              res <- results( dds )
              subset( res$padj < .1 & abs( res$log2FoldChange) > 1 )

              will give you genes for which there is significant evidence that their change is non-zero and for which the apparent log2 fold change is stronger than +/-1.


              Whithout fold-change shrinkage, especially weak genes show exaggerated fold-change, which makes it too easy for them to get over a post-hoc threshold. This is why testing for non-zero fold change and then thresholding on estimated LFC was never a too good idea: You get too many of the weaker genes.

              DESeq2 attacks this issue in two ways: (i) Shrinkage cause the fold-changes of the weak genes to be no longer exggerated. (ii) The new thresholded test gives you a list where you the false dicover rate (FDR) now really fits your test: a gene is considered a false positive if it seems to pass the LFC threshold but this was only due to noise.

              As you see, it was too easy to pas the threshold before. Hence, if you want to get a similar result list as earlier, you should use a smaller threshold when you use the newer, more stringent analysis.

              Comment


              • #22
                Hello all,

                I'm trying to compare DESeq2 with DSS. Basically I'm getting a lot more differentially expressed genes with DESeq2 than in DSS (on the order of 2-2.5 thousand vs. 5 to 6 hundred). We have 2 replicas for each condition. I've counted all the genes with FDR (in case of DSS) or adjusted p-value < 0.01 to be significant.

                What could be the reason for such dramatic difference? Correct me if I'm wrong, but adjusted p-value in DESeq2 is basically FDR, right? So adjusted p-values in DESeq2 output go very low (~ 10^-250), while FDR values in DSS go as low as 10^-10. Considering this, it's probably not that surprising that the number of DE genes is so different. But what is true reason for this, and which result is closer to the "truth"?

                I'd be grateful for any pointers.

                Comment


                • #23
                  It is very hard to answer your question without more information. What is the biology of your experiment? What are you comparing? How do the diagniostic plots (MA plot and dispersion plot) look like?

                  Comment


                  • #24
                    Hello,

                    I have a question regarding replaceOutliers function. More precisely, it looks like after using replaceOutliers, there are cases that non-outlier becomes an outlier.

                    I have 9 samples: 3 replicates for 3 conditions (6 days, 12 days and 28 days, as factors).

                    Here is what I do and what I get:
                    > des2.dataset <- DESeqDataSetFromMatrix(countData = countDF, colData = pheno[colnames(countDF),], design = ~ days)

                    countDF is just a data frame of raw counts. pheno is a data frame with information for each sample, including column "days".

                    > des2.mf <- DESeq(des2.dataset)

                    > des2.replace <- DESeq(replaceOutliers(des2.mf,minReplicates = 3))

                    Now, I have 160 genes, that by default detection method were not detected to be outliers. But after outlier replacement, they became outliers.

                    Here is an example of one such gene:

                    > assays(des2.mf)[["cooks"]]["12484",]
                    RME180 RME181 RME182 RME183 RME184 RME185 RME186
                    0.1072452 6.6523800 2.3861998 0.1974414 0.1226845 3.0418715 0.2286905
                    RME187 RMN013
                    3.1039154 11.1255294

                    > assays(des2.replace)[["cooks"]]["12484",]
                    RME180 RME181 RME182 RME183 RME184 RME185
                    0.07305022 7.57479726 2.35945901 0.12521243 0.00511494 3.05473959
                    RME186 RME187 RMN013
                    0.53951600 3.10378440 13.54871243


                    The cooks distance has changed, and now sample RMN013 has become an outlier... But the counts didn't:

                    > assays(des2.mf)[["counts"]]["12484",]
                    RME180 RME181 RME182 RME183 RME184 RME185 RME186 RME187 RMN013
                    18078 28989 1894 28078 4692 27 2373 0 1954
                    > assays(des2.replace)[["counts"]]["12484",]
                    RME180 RME181 RME182 RME183 RME184 RME185 RME186 RME187 RMN013
                    18078 28989 1894 28078 4692 27 2373 0 1954

                    So my question would be exactly that: Why did cooks distance change?
                    Another is merely concern over interpretation (given that this is intended behaviour): if I used replaceOutliers, its hard to justify having outliers in the data.... Furthermore, 5 out of these 160 cases were actually significant before outlier replacement. And I would definitely like to have these 5 in my hit-list.

                    Any thoughts are greatly appreciated.

                    Comment


                    • #25
                      Originally posted by Simon Anders View Post
                      It is very hard to answer your question without more information. What is the biology of your experiment? What are you comparing? How do the diagniostic plots (MA plot and dispersion plot) look like?
                      Hello Simon! Thank you for a prompt reply. Here's more information. This is a strand-specific paired-end RNA-seq experiment. I'm only counting read R2 for the reasons I shouldn't go into (we are working on some alterations of experimental RNA-seq methodology). The cells are murine DCs stimulated with LPS+IFNg, and conditions being compared are 2h and 4h post-treatment. There are 2 replicas for each condition.

                      Each replica/condition is 8.5Mbase deep, of which 70-75% are aligned uniquely. I've used HTseq to quantify the expression and have obtained 5-6 million overall "useful" counts (that were from uniquely mapped reads and have aligned with the feature in a strand-specific way).

                      So, here are some diagnostic plots.

                      MA plot of the shrunken data:



                      MA plot of unshrunken data in two different scales: (obtained from the original like this: ddsNP <- nbinomWaldTest(ddsMatrix,betaPrior=FALSE) )



                      Here are the dispersion plots. Dispersion of "40 equally spaced genes":



                      Dispersion estimates:



                      Overall, DESeq2 finds about 2600 differentially expressed genes, and DSS is limited to 500-600. I've considered genes to be differentially expressed if padj < 0.01 (fdr in case of DSS).

                      Comment


                      • #26
                        hi Tatsiana,

                        you might consider posting in a new thread for any follow up questions, so it's easier to track Q and A.

                        In the vignette we say "we suggest 7 or more replicates in order to consider replacement for large Cook’s distances"

                        I would not use the replace outliers procedure for 3 replicates. Also, we have moved the replacement procedure inside DESeq. If you examine your data and think that the samples called outlier are not, you can just turn off Cook's filtering with: results(dds, cooksCutoff=FALSE).

                        Comment


                        • #27
                          apredeus, if I am not mistaken, DSS uses a local FDR procedure on the test statistics to generate the FDR column. So if you have a wide distribution of log fold changes (and a wide distribution of Wald statistics), the locfdr software will match a wide Gaussian distribution and consider many of these genes part of the null distribution. To paraphrase, this procedure expects that a central bulk of the genes are not DE, and uses these genes to estimate the FDR. But in your experiment it looks like you might have many genes changing.

                          Note that if you want to hone in on larger fold changes across condition, you can use the lfcThreshold argument of results. See ?results for more information.

                          Comment


                          • #28
                            To add to Mike's answer: Your FDR of 0.01 is very stringent. With 0.1 you will get even more genes. But you should not be surprised about that. These seem to be cells cultured in a very well-controlled setting: the asymptote of the dispersion plot is around 10^-2, i.e., the coefficient of variation between replicates is less than 10%.

                            Hence, you are powered to see even rather weak effects. If your treatment is now harsh enough to disturb the cells strongly, most genes will feel this, though maybe only indirectly.

                            What you are interested in is the genes that react _strongly_. Hence,forget about p values, look at the genes with (strong) shrunken fold change. Or use DESeq2's new "thresholded" test to get only those genes whose response to treatment exceeds a threshold that reasonably indicates what you would consider biologically relevant.

                            Comment


                            • #29
                              Simon, Michael - thank you for your comments!

                              Originally posted by Simon Anders View Post
                              To add to Mike's answer: Your FDR of 0.01 is very stringent. With 0.1 you will get even more genes. But you should not be surprised about that. These seem to be cells cultured in a very well-controlled setting: the asymptote of the dispersion plot is around 10^-2, i.e., the coefficient of variation between replicates is less than 10%.
                              Yes, I understand, but you know - when you are saying that between 2h and 4h post-treatment you have about 5 thousand differentially expressed genes, people are going to freak out. However even when I filter the genes, leaving only those that have abs(log2FC) > 1, I still get more than a thousand, so it's certainly real.

                              However DSS was only giving me ~ 500 genes, of them 350 with log2FC above 1 or below -1. So I got very suspicious because in papers DSS and DESeq2 were giving very comparable results.

                              What is the right way to compare the results PRE-FDR? Correlation coefficients of the ranked lists of the top 10k genes, for example? Or distributions of un-adjusted p-values?

                              Hence, you are powered to see even rather weak effects. If your treatment is now harsh enough to disturb the cells strongly, most genes will feel this, though maybe only indirectly.

                              What you are interested in is the genes that react _strongly_. Hence,forget about p values, look at the genes with (strong) shrunken fold change. Or use DESeq2's new "thresholded" test to get only those genes whose response to treatment exceeds a threshold that reasonably indicates what you would consider biologically relevant.
                              Regarding this, I had few questions. First of all, the options listed in the manual (lfcThreshold, altHypothesis) seem to be absent in the actual installed version of DESeq2; here's the output of options from results function:

                              Code:
                              function (object, name, contrast, cooksCutoff, independentFiltering = TRUE, alpha = 0.1, filter, theta = seq(0, 0.95, by = 0.05), pAdjustMethod = "BH")
                              Second, I wanted to ask: when you are printing log2 fold change in the final table (results), is it shrunken or not?

                              Thank you again for the help, I appreciate it.

                              Comment


                              • #30
                                If you search for documentation online, you can find documentation for the release or development version when you have an older version on your computer.

                                You can read the correct documentation (the one paired with the version of Bioconductor you have installed) from within R with e.g. ?DESeq and vignette("DESeq2") or browseVignettes("DESeq2")

                                You can find out the version you have with sessionInfo(). We print the version number of our software on the first and last page of the vignette.

                                The LFC are shrunken in the results table (in the next release, v1.6 in Oct 2014, there is an option to print the unshrunken LFC alongside).

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM
                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 03-27-2024, 06:37 PM
                                0 responses
                                13 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-27-2024, 06:07 PM
                                0 responses
                                11 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-22-2024, 10:03 AM
                                0 responses
                                53 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-21-2024, 07:32 AM
                                0 responses
                                69 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X