Seqanswers Leaderboard Ad

**Simon Anders** · 01-24-2014, 12:28 AM

re 1.: Yes, with so many samples, you can (and should) skip the sharing step. Maybe use all genes to calculate the size factors, and then subset to only the genes you want before estimating dispersions.
Instead of calling 'fitDispersionFunction', simply copy the per-gene estimates into the column for the final dispersion value:

Code:

fData(ecs)$dispersion <- fData(pasillaExons)$dispBeforeSharing

re 2.: The per-gene dispersions are calculated by fitting a GLM and maximizing its likelihood (conditioned on the fitted values) as function of the dispersion value, with the Cox-Reid adjustment to reduce the bias due to the conditioning. Even if you used something quicker than fitting the GLM (maybe a method-of-moments approach), a GLM fit has to be done later once more anyway, so if were prohibitively slow, you won't escape the issue.

re 3.: As a fit is done, the conditions matter. You should pass a formula to estimate dispersions which distinguishes all your sub-groups (assuming that you still have enough degrees of freedom left this way), and then you can keep using these estimates.

**areyes** · 01-24-2014, 01:28 AM

An additional last thing, what version of DEXSeq are you using? just make sure that you are using a recent version of DEXSeq

ps. it is always nice/easier for us if to include the sessionInfo() and a reproducible workflow

**jcaswell** · 01-24-2014, 03:00 PM

Thank you both so much for such quick and helpful responses! DEXSeq is really a great package, and having helpful developers is, as I'm sure you know, so amazing.

I am using DEXSeq version 1.6.0 apparently (although I just downloaded it from Bioconductor a couple weeks ago). Full session info and workflow at the bottom of post.

Couple follow-up questions -- sorry they became so long:

1. I find that when I copy dispBeforeSharing into the dispersion column, while the testForDEU estimates p-values perfectly fine, estimatelog2FoldChanges returns NA for the log2fold values. If I do run fitDispersionFunction, then estimatelog2FoldChanges is able to calculate values. So I am guessing that the fitted expression values used to calculate log2fold changes are calculated in some way using information from fitDispersionFunction? Thoughts as to why this is happening, or ways to get around it? (since my goal is to test only a couple genes per condition, and so fitDispersionFunction would not be very accurate)

2. Regarding the conditions being used when estimating dispersions -- my plan is to test 50+ different conditions, each on only a couple of genes (e.g. scan across various areas of the genome for cis-splice eQTLs), so I am not sure it makes sense to put all of them into the formula. But it seems that the dispersion for each exon does not differ significantly between each of my many conditions, since these are just genotypes and not, for example, different tissues -- for example, when I ran estimateDispersions using two different conditions (genotypes) for one chromosome, the values for the exons correlated at 0.92 (Pearson's; 0.98 Spearman's). So I was wondering if it would be reasonable to estimate dispersions using one set of conditions, and then reload a sampleTable with the new set of conditions, then set fData(ecs)$dispersion to be the values estimated from the previous conditions, then run testForDEU. (This works fine within the machinery of the package.) In my mind, this would only be problematic if the dispersion for a given exon differed significantly between condition A1, condition A2, condition B1, and condition B2, which I have no reason to believe is the case. What do you think?

3. Related to question #2: what if a condition were ordinal or continuous rather than nominal (e.g., in this case, genotype -- but also age, amount of European ancestry, etc)? (I am hoping I will be able to modify the source code so that the negative binomial glm takes a numerical input instead of a factor input.) In this case, the dispersion parameter could not be estimated within conditions, so again, it would be nice to be able to move over the estimations from a separate run of estimateDispersions.

4. Lastly, do you think you could make up an approximate number of samples at which point we users should skip fitDispersionFunction? N = 50? 100? 200? And conversely, a number of genes at which point fitDispersionFunction becomes accurate/useful (i.e. it clearly wouldn't be if you just had a few genes, but clearly is at 1,000 genes)?

Thank you again so much!

Workflow (running on one chromosome at a time):
> library(DEXSeq)
> sampleTable <- read.table("samplefile.txt", header = T, row.names = 1)
> ecs <- read.HTSeqCounts(as.character(sampleTable$countFile), sampleTable, "file.gff")
> ecs <- estimateSizeFactors(ecs)
> ecs <- estimateDispersions(ecs, nCores = 32)
> fData(ecs)$dispersion <- fData(ecs)$dispBeforeSharing
> ecs <- testForDEU(ecs, nCores = 32)
> ecs <- estimatelog2FoldChanges(ecs, nCores = 32)

Output of sessionInfo():
R version 3.0.1 (2013-05-16)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] DEXSeq_1.6.0 Biobase_2.20.1 BiocGenerics_0.6.0
loaded via a namespace (and not attached):
[1] biomaRt_2.16.0 Biostrings_2.28.0 bitops_1.0-6
[4] GenomicRanges_1.12.5 hwriter_1.3 IRanges_1.18.4
[7] RCurl_1.95-4.1 Rsamtools_1.12.4 statmod_1.4.18
[10] stats4_3.0.1 stringr_0.6.2 XML_3.98-1.1
[13] zlibbioc_1.6.0

**areyes** · 01-28-2014, 07:32 AM

Hi,

I would suggest to update your R version (therefore Bioconductor and DEXSeq versions). This should improve the speed of your analysis significantly!

1. I find that when I copy dispBeforeSharing into the dispersion column, while the testForDEU estimates p-values perfectly fine, estimatelog2FoldChanges returns NA for the log2fold values. If I do run fitDispersionFunction, then estimatelog2FoldChanges is able to calculate values. So I am guessing that the fitted expression values used to calculate log2fold changes are calculated in some way using information from fitDispersionFunction? Thoughts as to why this is happening, or ways to get around it? (since my goal is to test only a couple genes per condition, and so fitDispersionFunction would not be very accurate)

You are correct, this is due to the fact that a variance stabilizing transformation is applied to the fold changes before converting them to log2 fold changes. The parameters from this transformation are computed in the fitDispersionFunction function. You will find this parameters after running fitDispersionFunction in:

ecs@dispFitCoefs

2. Regarding the conditions being used when estimating dispersions -- my plan is to test 50+ different conditions, each on only a couple of genes (e.g. scan across various areas of the genome for cis-splice eQTLs), so I am not sure it makes sense to put all of them into the formula. But it seems that the dispersion for each exon does not differ significantly between each of my many conditions, since these are just genotypes and not, for example, different tissues -- for example, when I ran estimateDispersions using two different conditions (genotypes) for one chromosome, the values for the exons correlated at 0.92 (Pearson's; 0.98 Spearman's). So I was wondering if it would be reasonable to estimate dispersions using one set of conditions, and then reload a sampleTable with the new set of conditions, then set fData(ecs)$dispersion to be the values estimated from the previous conditions, then run testForDEU. (This works fine within the machinery of the package.) In my mind, this would only be problematic if the dispersion for a given exon differed significantly between condition A1, condition A2, condition B1, and condition B2, which I have no reason to believe is the case. What do you think?

So, if I understand correctly, you want to use DEXSeq to find splicing-QTLs, for example regress on SNPs? Don't have experience with QTLs, but what you mention sounds reasonable. My feeling is that you would have to compute the dispersion estimates and test independently for every SNP independently in order to get accurate dispersion values, since I guess you will be changing the individual combinations for every SNP tested: This will computationally expensive (the glm fitters are slow!), but maybe a big cluster would be able to handle it.

If you have more than one sample per individual maybe it would be faster to compute for each individual something like we computed in this paper doi:10.1073/pnas.1307202110 (the relative exon usage coefficients), and maybe try another kind of regression or permutation test on this values. This would require a bit more of own coding, but you would have a template in the supplementary materials.

3. Related to question #2: what if a condition were ordinal or continuous rather than nominal (e.g., in this case, genotype -- but also age, amount of European ancestry, etc)? (I am hoping I will be able to modify the source code so that the negative binomial glm takes a numerical input instead of a factor input.) In this case, the dispersion parameter could not be estimated within conditions, so again, it would be nice to be able to move over the estimations from a separate run of estimateDispersions.

This also works in the GLM context, but you would have to update your DEXSeq for one of the most recent devel versions (at least one of this year).

4. Lastly, do you think you could make up an approximate number of samples at which point we users should skip fitDispersionFunction? N = 50? 100? 200? And conversely, a number of genes at which point fitDispersionFunction becomes accurate/useful (i.e. it clearly wouldn't be if you just had a few genes, but clearly is at 1,000 genes)?

I don't think I have an exact answer for these questions, but with more than 50 samples in one group you can use the dispersion values without sharing information across them. Regarding the how many genes, I guess as long as you have enough data points to fit the function, with 1000 genes would be fine.

Hope it is useful!
Alejandro

**jcaswell** · 01-28-2014, 10:33 AM

Very useful, thank you so much! I will work on getting R updated on our cluster and hopefully will be able to implement your suggestions, especially re: using a continuous condition. Thanks again!

**jcaswell** · 02-21-2014, 01:30 PM

Everything is working well with the updated R and updated DEXSeq per your recommendations.

I wanted to let you know about some strange results, which I suspect may be related to the large sample size, but wanted to know if you had any input.

In an otherwise negative study of N=300 samples between 3 conditions, most of the "positive" results, often with extreme p-values, came from lowly expressed genes, e.g.:

FAM163B, Exon 1: Raw p-value 4.25e-38, Adjusted p-value 4.57e-33
MTNR1B, Exon 2: Raw p-value 5.47e-58, Adjusted p-value 1.18e-52

FAM163B has 2 exons, and the total counts across both was 262.
MTNR1B has 2 exons, total counts across all was 778.

The plotDEXSeq() charts are attached. As you can see, it doesn't really look like the pattern is one of differential exon usage.

A couple other results also returned that were driven entirely by 1 (out of, say, 50) samples. When this one extreme sample was excluded, the association between differential exon expression and condition disappeared.

I am wondering if you have seen these types of spurious results in genes with low expression (for example, in smaller sample sizes when values less than minCount = 10 were used?), or with one extreme outlier in larger data sets. Perhaps the former can be mitigated just by increasing the minCount value? How did you come up with the value of 10 as default?

**Simon Anders** · 02-24-2014, 02:50 AM

Yes, I can see how this can happen. Negative-binomial GLMs can be sensitive to this kind of outliers. For this reason, DESeq2, our successor to DESeq, now implements a mechanism to detect and flag such outliers using Cook's distance (which measures how much the fit changes if a single data point is removed). We are currently working on a new version of DEXSeq, which will sit on top of DESeq2 and leverage all its improvement over DESeq. For your application, this should be quite useful.

BTW, here is our preprint on the new DESeq2 method:

Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2
Michael I Love, Wolfgang Huber, Simon Anders
bioRxiv, doi: 10.1101/002832

**areyes** · 02-24-2014, 04:06 AM

In the meantime you can try to increase the minimum counts theshold, 10 counts for 300 samples is definitely not much, that would mean that you need 1 count in 10 samples. Maybe you could even increase it to, say 3000, so you could expect something like ~10 counts per sample.

Alejandro

**jcaswell** · 02-24-2014, 10:50 AM

Great, thanks again for your input. I will keep an eye out for the new version of DEXSeq for sure.

**Yohann** · 05-05-2014, 09:56 AM

I'm using the last version of DEXSeq (1.10.0) and struggle with my analysis!

Data :

60 individuals
~ 600K exons
several conditions

Summary in DEXSeqDataSet variable :

HTML Code:

> dxd
class: DEXSeqDataSet 
dim: 588114 120 
exptData(0):
assays(1): counts
rownames(588114): ENSG00000000003:E001 ENSG00000000003:E002 ...
  ENSG00000273492:E006 ENSG00000273493:E001
rowData metadata column names(5): featureID groupID exonBaseMean
  exonBaseVar transcripts
colnames: NULL
colData names(5): sample AFcomp Flowcell countFiles exon

HTML Code:

formulaFullModel = ~ sample + exon + Flowcell:exon + AFcomp:exon
formulaReducedModel = ~ sample + exon + Flowcell:exon

I'm running it on a cluster with 12 workers and I never reached the end of the analysis (currently running for more than 20h). It is "stuck" on this step :

HTML Code:

dxd = estimateDispersions(
+     dxd,
+     formula = formulaFullModel,
+     BPPARAM = MulticoreParam(workers = 12) )

Am I doing something wrong ?
Should I only work on a subset of exons ?

Thanks !

**areyes** · 05-05-2014, 03:49 PM

Hi Yohann,

Thanks for reporting this!

I just committed DEXSeq version 1.10.1 to the release branch, the hang should be fixed there. You could either get it from the svn server right not or wait for one or two days so it is available via biocLite.

Alejandro

**jtatknox** · 07-22-2014, 12:46 PM

Alejandro,

Is this information still current?

Cheers,

J.T.

**kittyl** · 04-27-2015, 03:16 AM

Hi,

I am using DEXSeq_1.12.2 and I have a similar problem to the original poster in that I have a large RNA-seq experiment with over 100 samples. The estimateDispersion() step is really slow and I would like to skip the sharing step.

I tried this step:
fData(ecs)$dispersion <- fData(pasillaExons)$dispBeforeSharing

But it doesn't seem to work at all. Error is:
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function 'fData' for signature '"DEXSeqDataSet"'

I used rowData to look at the DexSeq object but $dispBeforeSharing is not one of the fields. Did something change in the new version of Dexseq and is there another way I can copy dispersion values?

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

DEXSeq with large number of samples

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News