Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • DEXSeq with large number of samples

    Hello!

    I am new to DEXSeq, and hoping to use it for some relatively large sample sizes (in the 100s). Not surprisingly, the estimateDispersions() function is exceedingly slow for such large sample sizes, even when I use the full 32 cores available to me on a quite powerful computer. So I have a few questions about using DEXSeq with large sample sizes that I was hoping someone (perhaps Alejandro or Simon?) could answer:

    1. If I only am interested in the differential exon expression for a very small subset of genes, is it reasonable to skip the dispersion sharing step (where the dispersion is estimated for all the different exon counting bins, a regression is performed, and the estimated dispersion for a given mean expression is calculated for each bin)? That is to say -- if you have 200 samples, can you trust that the per-exon calculation of dispersion is accurate? If not, what is the number of samples at which you can trust the per-exon calculation? Dispensing with the genome-wide dispersion estimates when possible would certainly help.

    2. More generally, I vaguely understand that the Cox-Reid estimate of dispersion is done because of the small sample size case. If you have a lot of samples, is there a faster approach to estimate the dispersion per exon that I could implement instead? (And what would be enough samples to do this?)

    3. How is estimateDispersions() using the condition information? If I had a few different conditions I wanted to compare against each other, could I use the dispersions calculated with the first set of conditions to do the ANOVA with the second set of conditions? E.g., some N=100 would be from condition A1 and N=100 from condition A2, and a random different N=50 would be from condition B1 and N=150 from condition B2.

    Thank you so much for any and all insight!

  • #2
    re 1.: Yes, with so many samples, you can (and should) skip the sharing step. Maybe use all genes to calculate the size factors, and then subset to only the genes you want before estimating dispersions.
    Instead of calling 'fitDispersionFunction', simply copy the per-gene estimates into the column for the final dispersion value:

    Code:
    fData(ecs)$dispersion <- fData(pasillaExons)$dispBeforeSharing
    re 2.: The per-gene dispersions are calculated by fitting a GLM and maximizing its likelihood (conditioned on the fitted values) as function of the dispersion value, with the Cox-Reid adjustment to reduce the bias due to the conditioning. Even if you used something quicker than fitting the GLM (maybe a method-of-moments approach), a GLM fit has to be done later once more anyway, so if were prohibitively slow, you won't escape the issue.

    re 3.: As a fit is done, the conditions matter. You should pass a formula to estimate dispersions which distinguishes all your sub-groups (assuming that you still have enough degrees of freedom left this way), and then you can keep using these estimates.

    Comment


    • #3
      An additional last thing, what version of DEXSeq are you using? just make sure that you are using a recent version of DEXSeq

      ps. it is always nice/easier for us if to include the sessionInfo() and a reproducible workflow

      Comment


      • #4
        Thank you both so much for such quick and helpful responses! DEXSeq is really a great package, and having helpful developers is, as I'm sure you know, so amazing.

        I am using DEXSeq version 1.6.0 apparently (although I just downloaded it from Bioconductor a couple weeks ago). Full session info and workflow at the bottom of post.

        Couple follow-up questions -- sorry they became so long:

        1. I find that when I copy dispBeforeSharing into the dispersion column, while the testForDEU estimates p-values perfectly fine, estimatelog2FoldChanges returns NA for the log2fold values. If I do run fitDispersionFunction, then estimatelog2FoldChanges is able to calculate values. So I am guessing that the fitted expression values used to calculate log2fold changes are calculated in some way using information from fitDispersionFunction? Thoughts as to why this is happening, or ways to get around it? (since my goal is to test only a couple genes per condition, and so fitDispersionFunction would not be very accurate)

        2. Regarding the conditions being used when estimating dispersions -- my plan is to test 50+ different conditions, each on only a couple of genes (e.g. scan across various areas of the genome for cis-splice eQTLs), so I am not sure it makes sense to put all of them into the formula. But it seems that the dispersion for each exon does not differ significantly between each of my many conditions, since these are just genotypes and not, for example, different tissues -- for example, when I ran estimateDispersions using two different conditions (genotypes) for one chromosome, the values for the exons correlated at 0.92 (Pearson's; 0.98 Spearman's). So I was wondering if it would be reasonable to estimate dispersions using one set of conditions, and then reload a sampleTable with the new set of conditions, then set fData(ecs)$dispersion to be the values estimated from the previous conditions, then run testForDEU. (This works fine within the machinery of the package.) In my mind, this would only be problematic if the dispersion for a given exon differed significantly between condition A1, condition A2, condition B1, and condition B2, which I have no reason to believe is the case. What do you think?

        3. Related to question #2: what if a condition were ordinal or continuous rather than nominal (e.g., in this case, genotype -- but also age, amount of European ancestry, etc)? (I am hoping I will be able to modify the source code so that the negative binomial glm takes a numerical input instead of a factor input.) In this case, the dispersion parameter could not be estimated within conditions, so again, it would be nice to be able to move over the estimations from a separate run of estimateDispersions.

        4. Lastly, do you think you could make up an approximate number of samples at which point we users should skip fitDispersionFunction? N = 50? 100? 200? And conversely, a number of genes at which point fitDispersionFunction becomes accurate/useful (i.e. it clearly wouldn't be if you just had a few genes, but clearly is at 1,000 genes)?

        Thank you again so much!

        Workflow (running on one chromosome at a time):
        > library(DEXSeq)
        > sampleTable <- read.table("samplefile.txt", header = T, row.names = 1)
        > ecs <- read.HTSeqCounts(as.character(sampleTable$countFile), sampleTable, "file.gff")
        > ecs <- estimateSizeFactors(ecs)
        > ecs <- estimateDispersions(ecs, nCores = 32)
        > fData(ecs)$dispersion <- fData(ecs)$dispBeforeSharing
        > ecs <- testForDEU(ecs, nCores = 32)
        > ecs <- estimatelog2FoldChanges(ecs, nCores = 32)

        Output of sessionInfo():
        R version 3.0.1 (2013-05-16)
        Platform: x86_64-pc-linux-gnu (64-bit)
        locale:
        [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
        [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
        [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
        [7] LC_PAPER=C LC_NAME=C
        [9] LC_ADDRESS=C LC_TELEPHONE=C
        [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
        attached base packages:
        [1] parallel stats graphics grDevices utils datasets methods
        [8] base
        other attached packages:
        [1] DEXSeq_1.6.0 Biobase_2.20.1 BiocGenerics_0.6.0
        loaded via a namespace (and not attached):
        [1] biomaRt_2.16.0 Biostrings_2.28.0 bitops_1.0-6
        [4] GenomicRanges_1.12.5 hwriter_1.3 IRanges_1.18.4
        [7] RCurl_1.95-4.1 Rsamtools_1.12.4 statmod_1.4.18
        [10] stats4_3.0.1 stringr_0.6.2 XML_3.98-1.1
        [13] zlibbioc_1.6.0

        Comment


        • #5
          Hi,

          I would suggest to update your R version (therefore Bioconductor and DEXSeq versions). This should improve the speed of your analysis significantly!

          1. I find that when I copy dispBeforeSharing into the dispersion column, while the testForDEU estimates p-values perfectly fine, estimatelog2FoldChanges returns NA for the log2fold values. If I do run fitDispersionFunction, then estimatelog2FoldChanges is able to calculate values. So I am guessing that the fitted expression values used to calculate log2fold changes are calculated in some way using information from fitDispersionFunction? Thoughts as to why this is happening, or ways to get around it? (since my goal is to test only a couple genes per condition, and so fitDispersionFunction would not be very accurate)
          You are correct, this is due to the fact that a variance stabilizing transformation is applied to the fold changes before converting them to log2 fold changes. The parameters from this transformation are computed in the fitDispersionFunction function. You will find this parameters after running fitDispersionFunction in:

          ecs@dispFitCoefs

          2. Regarding the conditions being used when estimating dispersions -- my plan is to test 50+ different conditions, each on only a couple of genes (e.g. scan across various areas of the genome for cis-splice eQTLs), so I am not sure it makes sense to put all of them into the formula. But it seems that the dispersion for each exon does not differ significantly between each of my many conditions, since these are just genotypes and not, for example, different tissues -- for example, when I ran estimateDispersions using two different conditions (genotypes) for one chromosome, the values for the exons correlated at 0.92 (Pearson's; 0.98 Spearman's). So I was wondering if it would be reasonable to estimate dispersions using one set of conditions, and then reload a sampleTable with the new set of conditions, then set fData(ecs)$dispersion to be the values estimated from the previous conditions, then run testForDEU. (This works fine within the machinery of the package.) In my mind, this would only be problematic if the dispersion for a given exon differed significantly between condition A1, condition A2, condition B1, and condition B2, which I have no reason to believe is the case. What do you think?
          So, if I understand correctly, you want to use DEXSeq to find splicing-QTLs, for example regress on SNPs? Don't have experience with QTLs, but what you mention sounds reasonable. My feeling is that you would have to compute the dispersion estimates and test independently for every SNP independently in order to get accurate dispersion values, since I guess you will be changing the individual combinations for every SNP tested: This will computationally expensive (the glm fitters are slow!), but maybe a big cluster would be able to handle it.

          If you have more than one sample per individual maybe it would be faster to compute for each individual something like we computed in this paper doi:10.1073/pnas.1307202110 (the relative exon usage coefficients), and maybe try another kind of regression or permutation test on this values. This would require a bit more of own coding, but you would have a template in the supplementary materials.

          3. Related to question #2: what if a condition were ordinal or continuous rather than nominal (e.g., in this case, genotype -- but also age, amount of European ancestry, etc)? (I am hoping I will be able to modify the source code so that the negative binomial glm takes a numerical input instead of a factor input.) In this case, the dispersion parameter could not be estimated within conditions, so again, it would be nice to be able to move over the estimations from a separate run of estimateDispersions.
          This also works in the GLM context, but you would have to update your DEXSeq for one of the most recent devel versions (at least one of this year).

          4. Lastly, do you think you could make up an approximate number of samples at which point we users should skip fitDispersionFunction? N = 50? 100? 200? And conversely, a number of genes at which point fitDispersionFunction becomes accurate/useful (i.e. it clearly wouldn't be if you just had a few genes, but clearly is at 1,000 genes)?
          I don't think I have an exact answer for these questions, but with more than 50 samples in one group you can use the dispersion values without sharing information across them. Regarding the how many genes, I guess as long as you have enough data points to fit the function, with 1000 genes would be fine.

          Hope it is useful!
          Alejandro

          Comment


          • #6
            Very useful, thank you so much! I will work on getting R updated on our cluster and hopefully will be able to implement your suggestions, especially re: using a continuous condition. Thanks again!

            Comment


            • #7
              Everything is working well with the updated R and updated DEXSeq per your recommendations.

              I wanted to let you know about some strange results, which I suspect may be related to the large sample size, but wanted to know if you had any input.

              In an otherwise negative study of N=300 samples between 3 conditions, most of the "positive" results, often with extreme p-values, came from lowly expressed genes, e.g.:

              FAM163B, Exon 1: Raw p-value 4.25e-38, Adjusted p-value 4.57e-33
              MTNR1B, Exon 2: Raw p-value 5.47e-58, Adjusted p-value 1.18e-52

              FAM163B has 2 exons, and the total counts across both was 262.
              MTNR1B has 2 exons, total counts across all was 778.

              The plotDEXSeq() charts are attached. As you can see, it doesn't really look like the pattern is one of differential exon usage.

              Click image for larger version

Name:	FAM163B.png
Views:	1
Size:	44.3 KB
ID:	304465

              Click image for larger version

Name:	MTNR1B.png
Views:	1
Size:	44.5 KB
ID:	304466

              A couple other results also returned that were driven entirely by 1 (out of, say, 50) samples. When this one extreme sample was excluded, the association between differential exon expression and condition disappeared.

              I am wondering if you have seen these types of spurious results in genes with low expression (for example, in smaller sample sizes when values less than minCount = 10 were used?), or with one extreme outlier in larger data sets. Perhaps the former can be mitigated just by increasing the minCount value? How did you come up with the value of 10 as default?

              Comment


              • #8
                Yes, I can see how this can happen. Negative-binomial GLMs can be sensitive to this kind of outliers. For this reason, DESeq2, our successor to DESeq, now implements a mechanism to detect and flag such outliers using Cook's distance (which measures how much the fit changes if a single data point is removed). We are currently working on a new version of DEXSeq, which will sit on top of DESeq2 and leverage all its improvement over DESeq. For your application, this should be quite useful.

                BTW, here is our preprint on the new DESeq2 method:

                Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2
                Michael I Love, Wolfgang Huber, Simon Anders
                bioRxiv, doi: 10.1101/002832
                Last edited by Simon Anders; 02-24-2014, 02:54 AM.

                Comment


                • #9
                  In the meantime you can try to increase the minimum counts theshold, 10 counts for 300 samples is definitely not much, that would mean that you need 1 count in 10 samples. Maybe you could even increase it to, say 3000, so you could expect something like ~10 counts per sample.

                  Alejandro

                  Comment


                  • #10
                    Great, thanks again for your input. I will keep an eye out for the new version of DEXSeq for sure.

                    Comment


                    • #11
                      I'm using the last version of DEXSeq (1.10.0) and struggle with my analysis!

                      Data :
                      • 60 individuals
                      • ~ 600K exons
                      • several conditions


                      Summary in DEXSeqDataSet variable :
                      HTML Code:
                      > dxd
                      class: DEXSeqDataSet 
                      dim: 588114 120 
                      exptData(0):
                      assays(1): counts
                      rownames(588114): ENSG00000000003:E001 ENSG00000000003:E002 ...
                        ENSG00000273492:E006 ENSG00000273493:E001
                      rowData metadata column names(5): featureID groupID exonBaseMean
                        exonBaseVar transcripts
                      colnames: NULL
                      colData names(5): sample AFcomp Flowcell countFiles exon
                      HTML Code:
                      formulaFullModel = ~ sample + exon + Flowcell:exon + AFcomp:exon
                      formulaReducedModel = ~ sample + exon + Flowcell:exon
                      I'm running it on a cluster with 12 workers and I never reached the end of the analysis (currently running for more than 20h). It is "stuck" on this step :

                      HTML Code:
                      dxd = estimateDispersions(
                      +     dxd,
                      +     formula = formulaFullModel,
                      +     BPPARAM = MulticoreParam(workers = 12) )
                      Am I doing something wrong ?
                      Should I only work on a subset of exons ?

                      Thanks !
                      Last edited by Yohann; 05-05-2014, 11:18 AM.

                      Comment


                      • #12
                        Hi Yohann,

                        Thanks for reporting this!

                        I just committed DEXSeq version 1.10.1 to the release branch, the hang should be fixed there. You could either get it from the svn server right not or wait for one or two days so it is available via biocLite.

                        Alejandro

                        Comment


                        • #13
                          Alejandro,

                          Is this information still current?

                          Cheers,

                          J.T.

                          Comment


                          • #14
                            Hi,

                            I am using DEXSeq_1.12.2 and I have a similar problem to the original poster in that I have a large RNA-seq experiment with over 100 samples. The estimateDispersion() step is really slow and I would like to skip the sharing step.

                            I tried this step:
                            fData(ecs)$dispersion <- fData(pasillaExons)$dispBeforeSharing

                            But it doesn't seem to work at all. Error is:
                            Error in (function (classes, fdef, mtable) :
                            unable to find an inherited method for function 'fData' for signature '"DEXSeqDataSet"'

                            I used rowData to look at the DexSeq object but $dispBeforeSharing is not one of the fields. Did something change in the new version of Dexseq and is there another way I can copy dispersion values?

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Current Approaches to Protein Sequencing
                              by seqadmin


                              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                              04-04-2024, 04:25 PM
                            • seqadmin
                              Strategies for Sequencing Challenging Samples
                              by seqadmin


                              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                              03-22-2024, 06:39 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, 04-11-2024, 12:08 PM
                            0 responses
                            30 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-10-2024, 10:19 PM
                            0 responses
                            32 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-10-2024, 09:21 AM
                            0 responses
                            28 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-04-2024, 09:00 AM
                            0 responses
                            53 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X