View Single Post
Old 03-12-2015, 08:01 AM   #1
Junior Member
Location: Ottawa, Ontario

Join Date: Sep 2014
Posts: 9
Default Questionable diagnostic plots from Cummerbund/DESeq2

Hi everyone,

I'm analyzing some RNA-seq data for a colleague and some of the results/diagnostic plots have raised some caution flags in my mind. I'll include the code I used to avoid going back and forth. Cliffnotes: Getting a large number of differentially expressed genes (which may or may not be normal?), large log2FCs, and diagnostic plots don't quite look typical.

Experimental Design:
RNA-seq before and after inducing differentiation(GNP -> GC). Two biological replicates for each condition.

Alignment & Counts (done for each of the four samples):
tophat -p 8 -G Mus.musculus.NCBIM37.65.gtf -o tophat_dir --no-novel-juncs mm9 GC2_R1.fastq GC2_R2.fastq
samtools sort -n accepted_hits.bam sorted.GC2
htseq-count -f bam sorted.GC2 Mus.musculus.NCBIM37.65.norRNA.gtf > GC2.counts.txt ##annotation lacks rRNA just in case
I then used DESeqDataSetFromHTSeqCount to get the summary and files set up:
sampleName        fileName condition
1  GC2.counts.txt  GC2.counts.txt        GC
2  GC3.counts.txt  GC3.counts.txt        GC
3 GNP2.counts.txt GNP2.counts.txt       GNP
4 GNP3.counts.txt GNP3.counts.txt       GNP

> dds
class: DESeqDataSet 
dim: 37651 4 
assays(3): counts mu cooks
rownames(37651): ENSMUSG00000000001 ENSMUSG00000000003 ... ENSMUSG00000093788
rowData metadata column names(27): baseMean baseVar ... deviance maxCooks
colnames(4): GC2.counts.txt GC3.counts.txt GNP2.counts.txt GNP3.counts.txt
colData names(2): condition sizeFactor
Ran DESeq and checked results:
dds <- DESeq(dds)
res <- results(dds)

> summary(res)

out of 24178 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)     : 5752, 24%  ##~1000 LFC > 2
LFC < 0 (down)   : 5786, 24% ##~1000 LFC < 2
outliers [1]     : 0, 0% 
low counts [2]   : 4687, 19% 
(mean count < 1.7)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

##MA plot
plotMA(res, main="DESeq2", ylim=c(-10,10)) ##Large number of significant genes with high mean exp value

##P value dist.
hist(res$pval, breaks=100) ##Very few genes with high p-value (looks similar with padj)

##Dispersion plot
plotDispEsts(dds) ##Not really sure if there's anything strange here

##Per gene standard deviation plots
#Script exactly as presented in DESeq2 vignette: shifted logarithm log2(n + 1) (left), the regularized log transformation(center), and the variance stabilizing transformation (right)
#Some really high SDs

##Euclidian distances
distsRL <- dist(t(assay(rld)))
What's reassuring is that the differentially expressed genes are enriched for relevant GO terms, but the plots are just looking different from others I have seen. Could anyone shed some light on whether there is anything to be alarmed about?

EDIT: Just realized I mentioned cuffdiff/cummeRbund in the title but didn't include anything from it. Volcano plot from there showed results consistent with the MA plot here.

David Cook
MSc. Candidate, University of OTtawa
Attached Files
File Type: pdf maplot.ylim10.pdf (1.20 MB, 12 views)
File Type: pdf pvalDist.pdf (5.1 KB, 7 views)
File Type: pdf dispersion.pdf (3.37 MB, 13 views)
File Type: pdf distplot.pdf (6.5 KB, 8 views)
File Type: pdf perGeneSD.pdf (230.3 KB, 6 views)

Last edited by DPCook; 03-12-2015 at 08:57 AM. Reason: Deceiving title, sorry
DPCook is offline   Reply With Quote