Hello!
I am new to DEXSeq, and hoping to use it for some relatively large sample sizes (in the 100s). Not surprisingly, the estimateDispersions() function is exceedingly slow for such large sample sizes, even when I use the full 32 cores available to me on a quite powerful computer. So I have a few questions about using DEXSeq with large sample sizes that I was hoping someone (perhaps Alejandro or Simon?) could answer:
1. If I only am interested in the differential exon expression for a very small subset of genes, is it reasonable to skip the dispersion sharing step (where the dispersion is estimated for all the different exon counting bins, a regression is performed, and the estimated dispersion for a given mean expression is calculated for each bin)? That is to say -- if you have 200 samples, can you trust that the per-exon calculation of dispersion is accurate? If not, what is the number of samples at which you can trust the per-exon calculation? Dispensing with the genome-wide dispersion estimates when possible would certainly help.
2. More generally, I vaguely understand that the Cox-Reid estimate of dispersion is done because of the small sample size case. If you have a lot of samples, is there a faster approach to estimate the dispersion per exon that I could implement instead? (And what would be enough samples to do this?)
3. How is estimateDispersions() using the condition information? If I had a few different conditions I wanted to compare against each other, could I use the dispersions calculated with the first set of conditions to do the ANOVA with the second set of conditions? E.g., some N=100 would be from condition A1 and N=100 from condition A2, and a random different N=50 would be from condition B1 and N=150 from condition B2.
Thank you so much for any and all insight!
I am new to DEXSeq, and hoping to use it for some relatively large sample sizes (in the 100s). Not surprisingly, the estimateDispersions() function is exceedingly slow for such large sample sizes, even when I use the full 32 cores available to me on a quite powerful computer. So I have a few questions about using DEXSeq with large sample sizes that I was hoping someone (perhaps Alejandro or Simon?) could answer:
1. If I only am interested in the differential exon expression for a very small subset of genes, is it reasonable to skip the dispersion sharing step (where the dispersion is estimated for all the different exon counting bins, a regression is performed, and the estimated dispersion for a given mean expression is calculated for each bin)? That is to say -- if you have 200 samples, can you trust that the per-exon calculation of dispersion is accurate? If not, what is the number of samples at which you can trust the per-exon calculation? Dispensing with the genome-wide dispersion estimates when possible would certainly help.
2. More generally, I vaguely understand that the Cox-Reid estimate of dispersion is done because of the small sample size case. If you have a lot of samples, is there a faster approach to estimate the dispersion per exon that I could implement instead? (And what would be enough samples to do this?)
3. How is estimateDispersions() using the condition information? If I had a few different conditions I wanted to compare against each other, could I use the dispersions calculated with the first set of conditions to do the ANOVA with the second set of conditions? E.g., some N=100 would be from condition A1 and N=100 from condition A2, and a random different N=50 would be from condition B1 and N=150 from condition B2.
Thank you so much for any and all insight!
Comment