Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Large difference in number of genes DESeq and EdgeR detect

    I recently posted a reply in another thread as I thought my question was similar, apparently not, so apologies for any confusion.

    I have a couple of questions regarding the magnitude of the differences of the number of genes that DESeq2 and EdgeR find.

    I have 12 samples, 4 sets of triplicates. 2 sets of triplicates are from around a wound site on a cut plant stem, one treated with a bacterial pathogen and one not. The other two are from a point distant to the wound site, again one treated with a pathogen, one not. All samples were taken two days after inoculation. This is from a small exploratory experiment which will help in the design of other experiments mapping out the initial course of the infection.

    When I compare any two sites using DESeq2, I get a number of genes. When I compare using EdgeR, the number of genes identified as D.E. is two to three orders of magnitude greater.

    On the off chance, that I've done something horrendously wrong the code that I'm using is as follows for DESeq:

    Code:
    design = data.frame(
      row.names = colnames( counts ),
      condition=c("Psa","Psa","Psa","PsaD","PsaD","Control","Control","Control","ControlD","ControlD"),
      libtype = c("paired-end","paired-end","paired-end","paired-end","paired-end","paired-end","paired-end","paired-end","paired-end","paired-end")
    )
    colData <- data.frame(row.names=colnames( counts), condition=as.factor(design$condition))
    dds <- DESeqDataSetFromMatrix(countData = counts,colData = colData, design =~  condition)
    dds <- DESeq(dds)
    results <- results(dds, lfcThreshold=2, altHypothesis="greaterAbs")
    res <- na.omit(results)
    res=res[res$padj < 0.05,]
    This gives me 48 D.E genes, mostly down regulated. As advised, I've used the banded hypothesis testing and unless, I'm reading it wrong, this should give me genes with a log2FC greater than +/-2.

    For EdgeR,
    Code:
    group <- factor(c("Psa","Psa","Psa","PsaD","PsaD","Sdw","Sdw","Sdw","SdwD","SdwD"))
    
    y <- DGEList(counts=counts,group=group)
    y <- calcNormFactors(y)
    y <- estimateCommonDisp(y)
    y <- estimateTagwiseDisp(y)
    
    design <- model.matrix(~0+group, data=y$samples)
    colnames(design) <- levels(y$samples$group)
    
    fit <- glmFit(y, design)
    lrt <- glmLRT(fit, contrast=c(-1,0,1,0))
    tags<-as.data.frame(topTags(lrt,39039,sort.by="logFC"))
    tags<- tags[tags$FDR < 0.05,]
    tagsUp<- tags[tags$logFC > 2,]
    tagsDown<- tags[tags$logFC < 2,]
    This gives me 97 D.E genes up-regulated and 3134 genes D.E. down-regulated.

    In the previous thread I commented in, it was suggested that I plot various statistics from the two groups and compare to see if there were large differences or just small numerical variation. The adjusted p values for the two groups were essentially the same, i.e. very, very low. There was some difference in the estimated fold changes, with the logFC being calculated as almost twice (~2 vs ~4/5) as high in EdgeR as it was in DESeq. I don't know if this constitutes a large difference or not, though it seems quite high to me, given that it's a logFC.

    It was also suggested that a large discrepancy could be due to the Cooks cutoff metric that DESeq uses. So I got the unfiltered DESeq results using:
    Code:
    rresults <- results(dds,contrast=c("condition","Control","Psa"), lfcThreshold=2, altHypothesis="greaterAbs",independentFiltering=FALSE,cooksCutoff=FALSE)
    This didn't really change the number of genes DESeq2 found (an extra 10 genes). I was assuming that that would rule out the Cooks cutoff being the cause of the discrepancy, though if I collect the list of genes that EdgeR identifies in a variable called edgerDown and plot the corresponding cooks cutoff data from DESeq2,
    Code:
    plot(factor(edgerDown),cooks[edgerDown,1:5],ylim=c(0,10))
    There does look like a large number of genes with a least one point looking like an outlier. I don't know whether those outliers are enough for DESeq to set them to NA without knowing how to find the cooks cutoff (so I can figure out where the 99th percentile is), which I can't figure from the DESeq documentation.

    So.

    Any thoughts as to where/how the huge discrepancy might be coming from? Or as to how I could go about figuring out how to find out where it's from? I'm not expecting them to be exactly the same obviously, but I would have thought they'd be in the same general ball park. Coincidently, if I use Limma/Voom, the order of magnitude difference increases again :-/

    Also, what constitutes a large difference in the calculated fold changes of the two methods?

    Cheers
    Ben.

  • #2
    hi Ben,

    DESeq2 with results() and no arguments is contrasting the last over the first level of condition. As we say in the vignette, if you do not specify these levels, they are chosen alphabetically.

    The best approach when you have anything other than a simple two group comparison is to use the contrast argument of results() to specify exactly what you want to compare, and the software does the rest for you, e.g.:

    results(dds, contrast=c("condition","Psa","Control"))

    So it appears for DESeq2 you now are contrasting PsaD over Control. (It will also tell you the contrast that was performed at the top of the results object when printed to the console, and in mcols(res).)

    edgeR also uses R's factor and model.matrix, so here you should have alphabetically chosen levels unless you specify otherwise. So for edgeR it looks like you are contrasting Sdw over Psa. Is Sdw the same as Control?

    So it seems like you are not making the same comparison.

    Comment


    • #3
      Apologies, that was a copy/paste error on my part.
      The DESeq comparisons that I've made are:
      Code:
      results <- results(dds,contrast=c("condition","Control","Psa"), lfcThreshold=2, altHypothesis="greaterAbs")
      and
      results <- results(dds,contrast=c("condition","Control","Psa"), lfcThreshold=2, altHypothesis="greaterAbs",independentFiltering=FALSE,cooksCutoff=FALSE)
      The first of which gives 48 genes, the second gives 57.

      The EdgeR comparison that I've made is
      Code:
      group <- factor(c("Psa","Psa","Psa","PsaD","PsaD","PsaD","Sdw","Sdw","Sdw","SdwD","SdwD","SdwD"))
      ...
      lrt <- glmLRT(fit, contrast=c(-1,0,1,0))
      Which, again, unless I'm reading the EdgeR manual incorrectly, should be comparing Psa and Sdw, being the first and third components of the design matrix.

      Sdw is for Sterile distilled water - yes, it is the control. I probably should keep the naming conventions the same across the methods to avoid confusing - I'll change that.

      Comment


      • #4
        Ok, so from that code it looks like you are testing the same contrast. However, the tests are not exactly the same. In DESeq2 the p-value is from a test that the absolute value of the LFC is greater than 2, while in edgeR the p-value is from a test that the absolute value of the LFC is greater than 0. For comparison can you try

        Code:
        res = results(dds, contrast=c("condition","Control","Psa"), addMLE=TRUE)
        sum(res$padj < .05 & abs(res$lfcMLE) > 2, na.rm=TRUE)

        Comment


        • #5
          I'm assuming that plotMA() will plot using the lfcMLE if it's present over the log2FoldChange?

          So in edgeR, to get those 3200 DE genes, I'm using the test, which, as you point out is testing that the p-value that the absolute LFC is >0. I'm then filtering on the logFC:
          Code:
          tags<- tags[tags$logFC > 2,]
          tags<- tags[tags$FDR < 0.1,]
          Which I thought was analogous to what I was orginally doing with DESeq2, i.e. testing lfc>0 and then filtering on logfc>2.

          Code:
          results <- results(dds,contrast=c("condition","Control","Psa"), lfcThreshold=2, altHypothesis="greaterAbs")
          sum(res$padj < 0.05,na.rm=TRUE)
          (i.e. what I have been doing) gives me 48 genes. This being the testing that the lfc > 2

          Code:
          res = results(dds, contrast=c("condition","Control","Psa"), addMLE=TRUE)
          sum(res$padj < .05 & abs(res$lfcMLE) > 2, na.rm=TRUE)
          gives me 506. Does this test that the lfcMLE is >0 or does it test the log2FoldChange > 0 and add in the lfcMLE so that I can filter on that?

          Which sounds pretty much as I would expect. Though. Doing it this way, using the lfcMLE, doesn't that introduce back into the analysis all of the high variation associated with low counts, which elsewhere is assiduously done away with? And since EdgeR also takes that into account, wouldn't I now be comparing two (even more) different analyses, one with high variation at low counts and one with low variation at low counts? Or am I missing something?

          Ben.

          Comment


          • #6
            What do lfcThreshold/altHypothesis do? The answer is right at hand. check the man page for ?results:

            lfcThreshold - a non-negative value, which specifies the test which should be applied to the log2 fold changes. The standard is a test that the log2 fold changes are not equal to zero. However, log2 fold changes greater or less than lfcThreshold can also be tested....

            This is also described in a section on threshold tests in our paper:



            and in the DESeq2 vignette in the section "Tests of log2 fold change above or below a threshold".

            Using lfcThreshold means that the p-values are from a much more stringent test of |LFC| > lfcThreshold. It's not simply a filter on large LFCs with a test of LFC = 0.

            Now that you are using more comparable steps for edgeR and DESeq2 (a test of LFC = 0 followed by a filter on MLE LFC) the numbers are coming closer together. It looks above like you are using 5% FDR for DESeq2 and 10% FDR for edgeR now.

            The tests (so the p-values) from both software pkgs take into account the high variance at low counts, and therefore filtering on p-values would take care of this in both cases.

            I'm not recommending that you to filter on lfcMLE, I'm just trying to explain your initial question about why there are many genes from an edgeR test of LFC = 0 and then filtering on |LFC| > 2, while there are far fewer when you test |LFC| > 2. It is because the test is much more stringent.

            Comment


            • #7
              Ahh... that light dawns.

              Thanks for the link. I think I've got it all sorted in my head now.

              Also, I've just noticed that when I was filtering on the lfc in edger I was filter on lfc>2 and lfc < 2 rather than lfc>2 and lfc < -2. Which was silly and changes things dramatically. Which makes the EdgeR results well within the same ballpark as the DESeq2 test on lfc>0. (Still going to use the banded results).

              And entertainingly, all 48 of the genes that I get with DESeq2, banded testing, are also identified by EdgeR, so that's good.

              Many thanks.
              Ben.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Essential Discoveries and Tools in Epitranscriptomics
                by seqadmin


                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                Yesterday, 07:01 AM
              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              39 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              41 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              35 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-04-2024, 09:00 AM
              0 responses
              55 views
              0 likes
              Last Post seqadmin  
              Working...
              X