SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
number of cores in DEXSeq using parallel library alittleboy Bioinformatics 8 03-28-2014 08:05 PM
Large number of MACS negative peaks feralBiologist Bioinformatics 0 12-12-2013 04:30 PM
multiplexing large number of samples lilletine Sample Prep / Library Generation 1 04-17-2013 01:33 AM
DEXSeq: Problem running more than 7 pairs of samples har Bioinformatics 8 04-12-2013 01:31 PM
how to specify samples for comparison in dexseq camelbbs Bioinformatics 3 02-01-2013 10:06 AM

Reply
 
Thread Tools
Old 01-23-2014, 12:21 PM   #1
jcaswell
Junior Member
 
Location: California

Join Date: Jan 2014
Posts: 6
Default DEXSeq with large number of samples

Hello!

I am new to DEXSeq, and hoping to use it for some relatively large sample sizes (in the 100s). Not surprisingly, the estimateDispersions() function is exceedingly slow for such large sample sizes, even when I use the full 32 cores available to me on a quite powerful computer. So I have a few questions about using DEXSeq with large sample sizes that I was hoping someone (perhaps Alejandro or Simon?) could answer:

1. If I only am interested in the differential exon expression for a very small subset of genes, is it reasonable to skip the dispersion sharing step (where the dispersion is estimated for all the different exon counting bins, a regression is performed, and the estimated dispersion for a given mean expression is calculated for each bin)? That is to say -- if you have 200 samples, can you trust that the per-exon calculation of dispersion is accurate? If not, what is the number of samples at which you can trust the per-exon calculation? Dispensing with the genome-wide dispersion estimates when possible would certainly help.

2. More generally, I vaguely understand that the Cox-Reid estimate of dispersion is done because of the small sample size case. If you have a lot of samples, is there a faster approach to estimate the dispersion per exon that I could implement instead? (And what would be enough samples to do this?)

3. How is estimateDispersions() using the condition information? If I had a few different conditions I wanted to compare against each other, could I use the dispersions calculated with the first set of conditions to do the ANOVA with the second set of conditions? E.g., some N=100 would be from condition A1 and N=100 from condition A2, and a random different N=50 would be from condition B1 and N=150 from condition B2.

Thank you so much for any and all insight!
jcaswell is offline   Reply With Quote
Old 01-23-2014, 11:28 PM   #2
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

re 1.: Yes, with so many samples, you can (and should) skip the sharing step. Maybe use all genes to calculate the size factors, and then subset to only the genes you want before estimating dispersions.
Instead of calling 'fitDispersionFunction', simply copy the per-gene estimates into the column for the final dispersion value:

Code:
fData(ecs)$dispersion <- fData(pasillaExons)$dispBeforeSharing
re 2.: The per-gene dispersions are calculated by fitting a GLM and maximizing its likelihood (conditioned on the fitted values) as function of the dispersion value, with the Cox-Reid adjustment to reduce the bias due to the conditioning. Even if you used something quicker than fitting the GLM (maybe a method-of-moments approach), a GLM fit has to be done later once more anyway, so if were prohibitively slow, you won't escape the issue.

re 3.: As a fit is done, the conditions matter. You should pass a formula to estimate dispersions which distinguishes all your sub-groups (assuming that you still have enough degrees of freedom left this way), and then you can keep using these estimates.
Simon Anders is offline   Reply With Quote
Old 01-24-2014, 12:28 AM   #3
areyes
Senior Member
 
Location: Heidelberg

Join Date: Aug 2010
Posts: 165
Default

An additional last thing, what version of DEXSeq are you using? just make sure that you are using a recent version of DEXSeq

ps. it is always nice/easier for us if to include the sessionInfo() and a reproducible workflow
areyes is offline   Reply With Quote
Old 01-24-2014, 02:00 PM   #4
jcaswell
Junior Member
 
Location: California

Join Date: Jan 2014
Posts: 6
Default

Thank you both so much for such quick and helpful responses! DEXSeq is really a great package, and having helpful developers is, as I'm sure you know, so amazing.

I am using DEXSeq version 1.6.0 apparently (although I just downloaded it from Bioconductor a couple weeks ago). Full session info and workflow at the bottom of post.

Couple follow-up questions -- sorry they became so long:

1. I find that when I copy dispBeforeSharing into the dispersion column, while the testForDEU estimates p-values perfectly fine, estimatelog2FoldChanges returns NA for the log2fold values. If I do run fitDispersionFunction, then estimatelog2FoldChanges is able to calculate values. So I am guessing that the fitted expression values used to calculate log2fold changes are calculated in some way using information from fitDispersionFunction? Thoughts as to why this is happening, or ways to get around it? (since my goal is to test only a couple genes per condition, and so fitDispersionFunction would not be very accurate)

2. Regarding the conditions being used when estimating dispersions -- my plan is to test 50+ different conditions, each on only a couple of genes (e.g. scan across various areas of the genome for cis-splice eQTLs), so I am not sure it makes sense to put all of them into the formula. But it seems that the dispersion for each exon does not differ significantly between each of my many conditions, since these are just genotypes and not, for example, different tissues -- for example, when I ran estimateDispersions using two different conditions (genotypes) for one chromosome, the values for the exons correlated at 0.92 (Pearson's; 0.98 Spearman's). So I was wondering if it would be reasonable to estimate dispersions using one set of conditions, and then reload a sampleTable with the new set of conditions, then set fData(ecs)$dispersion to be the values estimated from the previous conditions, then run testForDEU. (This works fine within the machinery of the package.) In my mind, this would only be problematic if the dispersion for a given exon differed significantly between condition A1, condition A2, condition B1, and condition B2, which I have no reason to believe is the case. What do you think?

3. Related to question #2: what if a condition were ordinal or continuous rather than nominal (e.g., in this case, genotype -- but also age, amount of European ancestry, etc)? (I am hoping I will be able to modify the source code so that the negative binomial glm takes a numerical input instead of a factor input.) In this case, the dispersion parameter could not be estimated within conditions, so again, it would be nice to be able to move over the estimations from a separate run of estimateDispersions.

4. Lastly, do you think you could make up an approximate number of samples at which point we users should skip fitDispersionFunction? N = 50? 100? 200? And conversely, a number of genes at which point fitDispersionFunction becomes accurate/useful (i.e. it clearly wouldn't be if you just had a few genes, but clearly is at 1,000 genes)?

Thank you again so much!

Workflow (running on one chromosome at a time):
> library(DEXSeq)
> sampleTable <- read.table("samplefile.txt", header = T, row.names = 1)
> ecs <- read.HTSeqCounts(as.character(sampleTable$countFile), sampleTable, "file.gff")
> ecs <- estimateSizeFactors(ecs)
> ecs <- estimateDispersions(ecs, nCores = 32)
> fData(ecs)$dispersion <- fData(ecs)$dispBeforeSharing
> ecs <- testForDEU(ecs, nCores = 32)
> ecs <- estimatelog2FoldChanges(ecs, nCores = 32)

Output of sessionInfo():
R version 3.0.1 (2013-05-16)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] DEXSeq_1.6.0 Biobase_2.20.1 BiocGenerics_0.6.0
loaded via a namespace (and not attached):
[1] biomaRt_2.16.0 Biostrings_2.28.0 bitops_1.0-6
[4] GenomicRanges_1.12.5 hwriter_1.3 IRanges_1.18.4
[7] RCurl_1.95-4.1 Rsamtools_1.12.4 statmod_1.4.18
[10] stats4_3.0.1 stringr_0.6.2 XML_3.98-1.1
[13] zlibbioc_1.6.0
jcaswell is offline   Reply With Quote
Old 01-28-2014, 06:32 AM   #5
areyes
Senior Member
 
Location: Heidelberg

Join Date: Aug 2010
Posts: 165
Default

Hi,

I would suggest to update your R version (therefore Bioconductor and DEXSeq versions). This should improve the speed of your analysis significantly!

Quote:
1. I find that when I copy dispBeforeSharing into the dispersion column, while the testForDEU estimates p-values perfectly fine, estimatelog2FoldChanges returns NA for the log2fold values. If I do run fitDispersionFunction, then estimatelog2FoldChanges is able to calculate values. So I am guessing that the fitted expression values used to calculate log2fold changes are calculated in some way using information from fitDispersionFunction? Thoughts as to why this is happening, or ways to get around it? (since my goal is to test only a couple genes per condition, and so fitDispersionFunction would not be very accurate)
You are correct, this is due to the fact that a variance stabilizing transformation is applied to the fold changes before converting them to log2 fold changes. The parameters from this transformation are computed in the fitDispersionFunction function. You will find this parameters after running fitDispersionFunction in:

ecs@dispFitCoefs

Quote:
2. Regarding the conditions being used when estimating dispersions -- my plan is to test 50+ different conditions, each on only a couple of genes (e.g. scan across various areas of the genome for cis-splice eQTLs), so I am not sure it makes sense to put all of them into the formula. But it seems that the dispersion for each exon does not differ significantly between each of my many conditions, since these are just genotypes and not, for example, different tissues -- for example, when I ran estimateDispersions using two different conditions (genotypes) for one chromosome, the values for the exons correlated at 0.92 (Pearson's; 0.98 Spearman's). So I was wondering if it would be reasonable to estimate dispersions using one set of conditions, and then reload a sampleTable with the new set of conditions, then set fData(ecs)$dispersion to be the values estimated from the previous conditions, then run testForDEU. (This works fine within the machinery of the package.) In my mind, this would only be problematic if the dispersion for a given exon differed significantly between condition A1, condition A2, condition B1, and condition B2, which I have no reason to believe is the case. What do you think?
So, if I understand correctly, you want to use DEXSeq to find splicing-QTLs, for example regress on SNPs? Don't have experience with QTLs, but what you mention sounds reasonable. My feeling is that you would have to compute the dispersion estimates and test independently for every SNP independently in order to get accurate dispersion values, since I guess you will be changing the individual combinations for every SNP tested: This will computationally expensive (the glm fitters are slow!), but maybe a big cluster would be able to handle it.

If you have more than one sample per individual maybe it would be faster to compute for each individual something like we computed in this paper doi:10.1073/pnas.1307202110 (the relative exon usage coefficients), and maybe try another kind of regression or permutation test on this values. This would require a bit more of own coding, but you would have a template in the supplementary materials.

Quote:
3. Related to question #2: what if a condition were ordinal or continuous rather than nominal (e.g., in this case, genotype -- but also age, amount of European ancestry, etc)? (I am hoping I will be able to modify the source code so that the negative binomial glm takes a numerical input instead of a factor input.) In this case, the dispersion parameter could not be estimated within conditions, so again, it would be nice to be able to move over the estimations from a separate run of estimateDispersions.
This also works in the GLM context, but you would have to update your DEXSeq for one of the most recent devel versions (at least one of this year).

Quote:
4. Lastly, do you think you could make up an approximate number of samples at which point we users should skip fitDispersionFunction? N = 50? 100? 200? And conversely, a number of genes at which point fitDispersionFunction becomes accurate/useful (i.e. it clearly wouldn't be if you just had a few genes, but clearly is at 1,000 genes)?
I don't think I have an exact answer for these questions, but with more than 50 samples in one group you can use the dispersion values without sharing information across them. Regarding the how many genes, I guess as long as you have enough data points to fit the function, with 1000 genes would be fine.

Hope it is useful!
Alejandro
areyes is offline   Reply With Quote
Old 01-28-2014, 09:33 AM   #6
jcaswell
Junior Member
 
Location: California

Join Date: Jan 2014
Posts: 6
Default

Very useful, thank you so much! I will work on getting R updated on our cluster and hopefully will be able to implement your suggestions, especially re: using a continuous condition. Thanks again!
jcaswell is offline   Reply With Quote
Old 02-21-2014, 12:30 PM   #7
jcaswell
Junior Member
 
Location: California

Join Date: Jan 2014
Posts: 6
Default

Everything is working well with the updated R and updated DEXSeq per your recommendations.

I wanted to let you know about some strange results, which I suspect may be related to the large sample size, but wanted to know if you had any input.

In an otherwise negative study of N=300 samples between 3 conditions, most of the "positive" results, often with extreme p-values, came from lowly expressed genes, e.g.:

FAM163B, Exon 1: Raw p-value 4.25e-38, Adjusted p-value 4.57e-33
MTNR1B, Exon 2: Raw p-value 5.47e-58, Adjusted p-value 1.18e-52

FAM163B has 2 exons, and the total counts across both was 262.
MTNR1B has 2 exons, total counts across all was 778.

The plotDEXSeq() charts are attached. As you can see, it doesn't really look like the pattern is one of differential exon usage.

FAM163B.png

MTNR1B.png

A couple other results also returned that were driven entirely by 1 (out of, say, 50) samples. When this one extreme sample was excluded, the association between differential exon expression and condition disappeared.

I am wondering if you have seen these types of spurious results in genes with low expression (for example, in smaller sample sizes when values less than minCount = 10 were used?), or with one extreme outlier in larger data sets. Perhaps the former can be mitigated just by increasing the minCount value? How did you come up with the value of 10 as default?
jcaswell is offline   Reply With Quote
Old 02-24-2014, 01:50 AM   #8
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Yes, I can see how this can happen. Negative-binomial GLMs can be sensitive to this kind of outliers. For this reason, DESeq2, our successor to DESeq, now implements a mechanism to detect and flag such outliers using Cook's distance (which measures how much the fit changes if a single data point is removed). We are currently working on a new version of DEXSeq, which will sit on top of DESeq2 and leverage all its improvement over DESeq. For your application, this should be quite useful.

BTW, here is our preprint on the new DESeq2 method:

Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2
Michael I Love, Wolfgang Huber, Simon Anders
bioRxiv, doi: 10.1101/002832

Last edited by Simon Anders; 02-24-2014 at 01:54 AM.
Simon Anders is offline   Reply With Quote
Old 02-24-2014, 03:06 AM   #9
areyes
Senior Member
 
Location: Heidelberg

Join Date: Aug 2010
Posts: 165
Default

In the meantime you can try to increase the minimum counts theshold, 10 counts for 300 samples is definitely not much, that would mean that you need 1 count in 10 samples. Maybe you could even increase it to, say 3000, so you could expect something like ~10 counts per sample.

Alejandro
areyes is offline   Reply With Quote
Old 02-24-2014, 09:50 AM   #10
jcaswell
Junior Member
 
Location: California

Join Date: Jan 2014
Posts: 6
Default

Great, thanks again for your input. I will keep an eye out for the new version of DEXSeq for sure.
jcaswell is offline   Reply With Quote
Old 05-05-2014, 09:56 AM   #11
Yohann
Junior Member
 
Location: Montréal

Join Date: Aug 2013
Posts: 7
Exclamation

I'm using the last version of DEXSeq (1.10.0) and struggle with my analysis!

Data :
  • 60 individuals
  • ~ 600K exons
  • several conditions

Summary in DEXSeqDataSet variable :
HTML Code:
> dxd
class: DEXSeqDataSet 
dim: 588114 120 
exptData(0):
assays(1): counts
rownames(588114): ENSG00000000003:E001 ENSG00000000003:E002 ...
  ENSG00000273492:E006 ENSG00000273493:E001
rowData metadata column names(5): featureID groupID exonBaseMean
  exonBaseVar transcripts
colnames: NULL
colData names(5): sample AFcomp Flowcell countFiles exon
HTML Code:
formulaFullModel = ~ sample + exon + Flowcell:exon + AFcomp:exon
formulaReducedModel = ~ sample + exon + Flowcell:exon
I'm running it on a cluster with 12 workers and I never reached the end of the analysis (currently running for more than 20h). It is "stuck" on this step :

HTML Code:
dxd = estimateDispersions(
+     dxd,
+     formula = formulaFullModel,
+     BPPARAM = MulticoreParam(workers = 12) )
Am I doing something wrong ?
Should I only work on a subset of exons ?

Thanks !

Last edited by Yohann; 05-05-2014 at 11:18 AM.
Yohann is offline   Reply With Quote
Old 05-05-2014, 03:49 PM   #12
areyes
Senior Member
 
Location: Heidelberg

Join Date: Aug 2010
Posts: 165
Default

Hi Yohann,

Thanks for reporting this!

I just committed DEXSeq version 1.10.1 to the release branch, the hang should be fixed there. You could either get it from the svn server right not or wait for one or two days so it is available via biocLite.

Alejandro
areyes is offline   Reply With Quote
Old 07-22-2014, 12:46 PM   #13
jtatknox
Junior Member
 
Location: New York

Join Date: Jul 2014
Posts: 1
Default

Alejandro,

Is this information still current?

Cheers,

J.T.
jtatknox is offline   Reply With Quote
Old 04-27-2015, 03:16 AM   #14
kittyl
Junior Member
 
Location: London

Join Date: Apr 2015
Posts: 1
Default

Hi,

I am using DEXSeq_1.12.2 and I have a similar problem to the original poster in that I have a large RNA-seq experiment with over 100 samples. The estimateDispersion() step is really slow and I would like to skip the sharing step.

I tried this step:
fData(ecs)$dispersion <- fData(pasillaExons)$dispBeforeSharing

But it doesn't seem to work at all. Error is:
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function 'fData' for signature '"DEXSeqDataSet"'

I used rowData to look at the DexSeq object but $dispBeforeSharing is not one of the fields. Did something change in the new version of Dexseq and is there another way I can copy dispersion values?
kittyl is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:03 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO