Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • DPCook
    Member
    • Sep 2014
    • 10

    Questionable diagnostic plots from Cummerbund/DESeq2

    Hi everyone,

    I'm analyzing some RNA-seq data for a colleague and some of the results/diagnostic plots have raised some caution flags in my mind. I'll include the code I used to avoid going back and forth. Cliffnotes: Getting a large number of differentially expressed genes (which may or may not be normal?), large log2FCs, and diagnostic plots don't quite look typical.

    Experimental Design:
    RNA-seq before and after inducing differentiation(GNP -> GC). Two biological replicates for each condition.

    Alignment & Counts (done for each of the four samples):
    Code:
    tophat -p 8 -G Mus.musculus.NCBIM37.65.gtf -o tophat_dir --no-novel-juncs mm9 GC2_R1.fastq GC2_R2.fastq
    samtools sort -n accepted_hits.bam sorted.GC2
    htseq-count -f bam sorted.GC2 Mus.musculus.NCBIM37.65.norRNA.gtf > GC2.counts.txt ##annotation lacks rRNA just in case
    I then used DESeqDataSetFromHTSeqCount to get the summary and files set up:
    Code:
    sampleName        fileName condition
    1  GC2.counts.txt  GC2.counts.txt        GC
    2  GC3.counts.txt  GC3.counts.txt        GC
    3 GNP2.counts.txt GNP2.counts.txt       GNP
    4 GNP3.counts.txt GNP3.counts.txt       GNP
    
    > dds
    class: DESeqDataSet 
    dim: 37651 4 
    exptData(0):
    assays(3): counts mu cooks
    rownames(37651): ENSMUSG00000000001 ENSMUSG00000000003 ... ENSMUSG00000093788
      ENSMUSG00000093789
    rowData metadata column names(27): baseMean baseVar ... deviance maxCooks
    colnames(4): GC2.counts.txt GC3.counts.txt GNP2.counts.txt GNP3.counts.txt
    colData names(2): condition sizeFactor
    Ran DESeq and checked results:
    Code:
    dds <- DESeq(dds)
    res <- results(dds)
    
    > summary(res)
    
    out of 24178 with nonzero total read count
    adjusted p-value < 0.1
    LFC > 0 (up)     : 5752, 24%  ##~1000 LFC > 2
    LFC < 0 (down)   : 5786, 24% ##~1000 LFC < 2
    outliers [1]     : 0, 0% 
    low counts [2]   : 4687, 19% 
    (mean count < 1.7)
    [1] see 'cooksCutoff' argument of ?results
    [2] see 'independentFiltering' argument of ?results
    
    ##MA plot
    plotMA(res, main="DESeq2", ylim=c(-10,10)) ##Large number of significant genes with high mean exp value
    
    ##P value dist.
    hist(res$pval, breaks=100) ##Very few genes with high p-value (looks similar with padj)
    
    ##Dispersion plot
    plotDispEsts(dds) ##Not really sure if there's anything strange here
    
    ##Per gene standard deviation plots
    #Script exactly as presented in DESeq2 vignette: shifted logarithm log2(n + 1) (left), the regularized log transformation(center), and the variance stabilizing transformation (right)
    #Some really high SDs
    
    ##Euclidian distances
    distsRL <- dist(t(assay(rld)))
    What's reassuring is that the differentially expressed genes are enriched for relevant GO terms, but the plots are just looking different from others I have seen. Could anyone shed some light on whether there is anything to be alarmed about?

    EDIT: Just realized I mentioned cuffdiff/cummeRbund in the title but didn't include anything from it. Volcano plot from there showed results consistent with the MA plot here.

    David Cook
    MSc. Candidate, University of OTtawa
    Attached Files
    Last edited by DPCook; 03-12-2015, 08:57 AM. Reason: Deceiving title, sorry
  • Michael Love
    Senior Member
    • Jul 2013
    • 333

    #2
    PCA plot is also nice for seeing sample distances. From the distplot these conditions look very distinct. Why don't you consider testing at a higher threshold than |LFC| > 0, as this seems to be achieved by many genes? See the lfcThreshold argument of ?results. The reasoning is described in the "Specifying minimum effect size" section of the paper.

    Comment

    • DPCook
      Member
      • Sep 2014
      • 10

      #3
      Thanks for the reply Michael. I adjusted the lfc threshold and it certainly made the list a bit more manageable (LFC > 1.5, FDR=0.05 yielded about 800 DEGs in both directions). I guess I just wasn't expecting such large differences between the two conditions and was concerned that I was missing some obvious artefact that could cause it.

      I also ran the PCA following regularize log transformation to look at distances. PC1 is apparently capturing 100% variance and splits the two conditions. I'm no expert, so correct me if I'm wrong, but I suppose this supports that idea that the results are just the product of very distinct conditions because the variability between biological replicates is negligible (at least relative to the differences between conditions).

      Thanks!
      Attached Files

      Comment

      • netgear extender setup
        Junior Member
        • Nov 2019
        • 1

        #4
        Thanks for sharing. We provide full support for all your Arlo devices, including guidance for your netgear extender setup . So if you are having issues with connecting to the Wifi or configure settings on the Arlo app, then contact us using our live chat services or our email. You can also call us using our customer support phone number.

        Comment

        Latest Articles

        Collapse

        • SEQadmin2
          Nine Things a Sample Prep Scientist Thinks About Before Sequencing
          by SEQadmin2


          I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

          Here are nine questions we think about, in roughly the order they matter, before...
          06-18-2026, 07:11 AM
        • SEQadmin2
          From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
          by SEQadmin2


          Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


          The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
          ...
          06-02-2026, 10:05 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by SEQadmin2, 06-26-2026, 11:10 AM
        0 responses
        12 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-17-2026, 06:09 AM
        0 responses
        48 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-09-2026, 11:58 AM
        0 responses
        106 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-05-2026, 10:09 AM
        0 responses
        125 views
        0 reactions
        Last Post SEQadmin2  
        Working...