Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Neuromancer
    Member
    • Aug 2011
    • 28

    #91
    Can an updated ENSEMBL database release have an effect on DESeq2 results?

    Well, obviously it can, but the changes in the output I see are rather dramatic.

    Here's in brief my situtation:

    Data from an RNAseq experiment (mouse) was
    mapped to GRCm38p2 in Feb. 2014, using STAR,
    counted with HTSeq (latest version),
    DEG estimations done with DESeq2 (latest version at that time)
    and Ensemble release e75 GTF file.
    Results were interesting and made sense, biologically.

    Now we remapped everything to GRCm38p3
    using Ensemble release e78 GTF and latest versions of those programms (which have not changed much).
    Results are radically different, with some important changes gone, (which we validated by qPCR!)
    (example: a gene with log2FC=-0,63, padj=0.0688 in the first mapping, and log2FC=-0.21, padj=0.22 in the second) - however, the counts for each sample and on average (baseMean) are very very similar, so I guess it's not the mapping that makes the difference.

    Notably the number of annotated genes found in e78 is much higher compared to e75 so I expect this has an influence on the adjusted p-value - but how can it affect the log2FC!? Could it be that the number of genes tested affects the gene models / dispersions that DESeq2 estimates?

    I would be happy about any comments.

    Thanks.

    Comment

    • dpryan
      Devon Ryan
      • Jul 2011
      • 3478

      #92
      Presumably e78 added a bunch of overlapping genes that are resulting in heavily altered counts for those genes.

      Comment

      • Neuromancer
        Member
        • Aug 2011
        • 28

        #93
        Originally posted by dpryan View Post
        Presumably e78 added a bunch of overlapping genes that are resulting in heavily altered counts for those genes.
        Well, at least the raw counts are very(!) similar, not to say: almost exactly the same for most genes.
        Also, with more/better genome information, I think, the DEG estimations should be getting better, not worse, right? We confirmed a couple of genes by qPCR and in many cases they are showing surprisingly accurate matches of the RNAseq results, but then again, we do find some genes where this nice match is completely lost using e78...

        Comment

        • dpryan
          Devon Ryan
          • Jul 2011
          • 3478

          #94
          I wonder if the additional genes in the analysis are skewing the prior in either the fits or the p-value adjustment (just adding a background uniform p-value distribution in that case). That'd be the other likely possibility. What happens if you simply exclude the added genes from the analysis?

          Comment

          • Michael Love
            Senior Member
            • Jul 2013
            • 333

            #95
            re: changing the gene model from e75 to e78: Although it might seem at first that having the same read count for a given gene should mean that the results are identical, consider that all the parameter estimation steps involve looking at all the genes: including the library size normalization, the dispersion estimation, the posterior log fold changes, and the p-value adjustment (as Devon points out). If you want less dependence between genes, you could try using the setting betaPrior=FALSE with the DESeq() function, and then the inference is on maximum likelihood estimates of log fold change. This is a choice for the investigator, if they prefer slightly noisier LFC estimates, but an analysis such that individual genes will have less effect on each other's fold changes. But there will still be dependence between genes on dispersion estimation and p-value adjustment, so you shouldn't expect the p-values to be identical after changing gene models. The inference is nearly identical in terms of hypothesis testing, although genes on the margin of a significance boundary can have p-values move slightly one way or the other.

            Comment

            • Neuromancer
              Member
              • Aug 2011
              • 28

              #96
              Dear Michael and Devon,

              thank you for taking the time to think about this and write an answer.
              These are importnat consideradtions and we will take them into account now.

              The other thing that has changed between the two analyses is of course the DESeq2 version (v1.4.1 vs 1.6.3), which might also influence results slightly, I guess.

              Removing the new IDs is not really an option because it not only new ones, but also substitutes and other changes that are updated with new ensembl releases.

              We'll try the option beta-prior=false to see what effect this has on our data set.

              Thanks again,

              Roman

              Comment

              • Gonza
                Member
                • Mar 2013
                • 78

                #97
                Hi all,

                I am new to DEseq2. Maybe this is silly question, but I do not understand why meanSdPlot fails.... any ideas to solve this problem?


                library("vsn")
                par(mfrow=c(1,3))
                notAllZero <- (rowSums(counts(dds))>0)
                meanSdPlot(log2(counts(dds,normalized=TRUE)[notAllZero,] + 1))
                meanSdPlot(assay(rld[notAllZero,]))
                meanSdPlot(assay(vsd[notAllZero,]))

                Error: could not find function "meanSdPlot"


                Thanks

                Comment

                • IsBeth
                  Member
                  • Nov 2013
                  • 28

                  #98
                  Hello!

                  I'm trying DESeq2 with synthetic RNA-seq count data without replicates (40% of the synthetic genes are differentially expressed (DE), with a up - and downregulation ratio of 50%). Although DESeq2 detects these ratios correctly, it doesn't detect the DE genes. The result table is empty, and my MAplot doesn't show differential expression either. Is there any solution to this? Do I have to change any parameter?

                  Comment

                  • Michael Love
                    Senior Member
                    • Jul 2013
                    • 333

                    #99
                    When you ran DESeq(), you should have seen the following message printed to console:

                    Code:
                    estimating dispersion by treating samples as replicates.
                    read the ?DESeq section on 'Experiments without replicates'
                    Did you read this section?

                    Comment

                    • IsBeth
                      Member
                      • Nov 2013
                      • 28

                      Yes, I did. And I'm sorry, since I've seen that this problem is treated in other posts (such as "DESeq2 without biol replicates"). I tried the following code, but it doesn't work by now (I need to obtain a result table with statistics like the adjusted p-values and lfc):

                      design_vector <- c("C", "T")

                      coldat=DataFrame(grp=factor(design_vector), each=1)
                      dds <- DESeqDataSetFromMatrix(raw_filter, colData=coldat, design =~grp)
                      dds <- DESeq(dds)

                      rld <- rlogTransformation( dds )
                      NOTE: fitType='parametric', but the dispersion trend was not well captured by the
                      function: y = a/x + b, and a local regression fit was automatically substituted.
                      specify fitType='local' or 'mean' to avoid this message next time.

                      deseq2.res <- results(dds)


                      What do I have to change? Cause I'm sure I'm doing it completely wrong

                      Comment

                      • Michael Love
                        Senior Member
                        • Jul 2013
                        • 333

                        I guess the point I want to emphasize is the quote from the original DESeq paper, regarding the without-replicates analysis:

                        "Some overestimation of the variance may be expected, which will make that approach conservative."
                        Furthermore, "while one may not want to draw strong conclusions from such an analysis, it
                        may still be useful for exploration and hypothesis generation."

                        The analysis without replicates is very conservative, and only recommended for exploration (for example, you can look at which log fold changes are largest in the MA plot).

                        So to answer your question from before:

                        "it doesn't detect the DE genes. The result table is empty, and my MAplot doesn't show differential expression either"

                        The without-replicates analysis is not expected to produce small p-values or adjusted p-values. We have overestimated the variance by treating the samples from different conditions as replicates for the purpose of estimating variance.

                        Comment

                        • IsBeth
                          Member
                          • Nov 2013
                          • 28

                          The without-replicates analysis is not expected to produce small p-values or adjusted p-values. We have overestimated the variance by treating the samples from different conditions as replicates for the purpose of estimating variance
                          Ok, so when estimating the variance this way, it is not sufficient to detect DE genes? I know that to work with no replicates is not appropiate, but in these cases I need to provide some information of the data for the end user . Isn't there some way to obtain a DE genes list (even if the results are to be taken with care)?
                          (I think that it was possible with DESeq)

                          Comment

                          • Michael Love
                            Senior Member
                            • Jul 2013
                            • 333

                            DESeq() in DESeq2 also estimates p-values, and adjusted p-values (for those genes which pass independent filtering step).

                            I would guess that you are filtering with a p-value or adjusted p-value threshold which is less than all the p-values or adjusted p-values in the object returned by results(), and generating a 0-row dataframe.

                            Comment

                            • IsBeth
                              Member
                              • Nov 2013
                              • 28

                              Hi Michael,

                              You were absolutely right I had filtered the object returned from the results() function with a minimum log2foldChange of 1.5 and an adjusted p-value of 0.01. It was definitively too much. The lfc of the result data is less than 1 in most cases, and the adj p-value is superior than one...

                              With no filtering, I get the results table with the statistics I need, but no DE genes are detected and so none are plotted (with red points). Is there some way tu increase the statistical power (maybe in the results() function)?Or to change some parameters in order to mark genes as differentially expressed (even if their values are not magnific)?

                              Thank you for your time =)

                              Comment

                              • apredeus
                                Senior Member
                                • Jul 2012
                                • 151

                                Here's a bit random and fascinating observation I've made when using DESeq2.

                                I've had three groups of experiments - with conditions labelled as "B", "12", and "24". I've ran pairwise comparison with donor-sensitive design on all three groups, getting differential expression for them in the same way:

                                B vs 12 (cond1 = B, cond2 = 12)
                                B vs 24 (cond1 = B, cond2 = 24)
                                12 vs 24 (cond1 = 12, cond2 = 24)

                                so, in first two cases, log2FC was reported for cond1/cond2 (i.e. positive numbers were for genes down-regulated at 12 or 24). In the latter case, log2FC was reported for cond2/cond1.

                                The differences were subtle enough that it took me a long long while to catch this error. I figured it's probably due to some sorting - either internal or in the condition file. Maybe it would make sense to write explicitly in the logs which one is cond1 and which one is cond2?

                                Cheers

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Pathogen Surveillance with Advanced Genomic Tools
                                  by seqadmin




                                  The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
                                  03-24-2025, 11:48 AM
                                • seqadmin
                                  New Genomics Tools and Methods Shared at AGBT 2025
                                  by seqadmin


                                  This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                                  The Headliner
                                  The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                                  03-03-2025, 01:39 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Today, 12:59 PM
                                0 responses
                                6 views
                                0 reactions
                                Last Post seqadmin  
                                Started by seqadmin, Yesterday, 10:17 AM
                                0 responses
                                7 views
                                0 reactions
                                Last Post seqadmin  
                                Started by seqadmin, 03-20-2025, 05:03 AM
                                0 responses
                                49 views
                                0 reactions
                                Last Post seqadmin  
                                Started by seqadmin, 03-19-2025, 07:27 AM
                                0 responses
                                60 views
                                0 reactions
                                Last Post seqadmin  
                                Working...