Hello all,
I am working on analyzing RNA-Seq samples for differential expression and am perplexed by the distribution of my data, which appears to be under- rather than overdispersed relative to a Poisson distribution.
For background:
1. I created a preliminary de Novo assembly of my RNA-Seq transcripts using CLC workbench (reads were first trimmed, etc, reads were high quality and looked good using diagnostics in FastQC).
2. I mapped my reads to the de Novo assembly using the RNA-Seq mapping function in CLC workbench.
3. I imported my mapped read counts into R, to analyze the differential expression using edgeR
In edgeR, my common dispersion parameter is ~2.5. I checked the MA Plots and noticed how 'odd' my data distributions appeared. The attached plot is an example of a comparison between lanes using an MA plot (all between lane comparisons look similar regardless of experiment group).
Histograms similarly confirm that aside from 'zeros', my most about count data (by lane) is in the thousands of reads per de Novo contig.
My questions are as follows:
1. Have others encountered similar data distributions?
2. If so, are there any programs to analyze underdispersed (or whatever this is) data? DEGseq seems like it may be 'closer' as it assumes a Poisson distribution, but this still isn't the correct distribution. I've also been reading about using GLM for generalized Poisson distributions.
3. Could this all be caused by a suboptimal de Novo assembly? Or by RNA degradation prior to sequencing?
Any thoughts would be greatly appreciated! I am currently working to generate a 'best' assembly using multiple kmer lengths in Oases-Velvet, and to annotate some of the very abundant contigs for biological relevance. Any other suggestions regarding issues related to contamination, assembly,etc?
Thanks so much from a beginner!!
Celli
I am working on analyzing RNA-Seq samples for differential expression and am perplexed by the distribution of my data, which appears to be under- rather than overdispersed relative to a Poisson distribution.
For background:
1. I created a preliminary de Novo assembly of my RNA-Seq transcripts using CLC workbench (reads were first trimmed, etc, reads were high quality and looked good using diagnostics in FastQC).
2. I mapped my reads to the de Novo assembly using the RNA-Seq mapping function in CLC workbench.
3. I imported my mapped read counts into R, to analyze the differential expression using edgeR
In edgeR, my common dispersion parameter is ~2.5. I checked the MA Plots and noticed how 'odd' my data distributions appeared. The attached plot is an example of a comparison between lanes using an MA plot (all between lane comparisons look similar regardless of experiment group).
Histograms similarly confirm that aside from 'zeros', my most about count data (by lane) is in the thousands of reads per de Novo contig.
My questions are as follows:
1. Have others encountered similar data distributions?
2. If so, are there any programs to analyze underdispersed (or whatever this is) data? DEGseq seems like it may be 'closer' as it assumes a Poisson distribution, but this still isn't the correct distribution. I've also been reading about using GLM for generalized Poisson distributions.
3. Could this all be caused by a suboptimal de Novo assembly? Or by RNA degradation prior to sequencing?
Any thoughts would be greatly appreciated! I am currently working to generate a 'best' assembly using multiple kmer lengths in Oases-Velvet, and to annotate some of the very abundant contigs for biological relevance. Any other suggestions regarding issues related to contamination, assembly,etc?
Thanks so much from a beginner!!
Celli
Comment