Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • DESeq2: Why is my gene not differentially expressed?

    Dear all,

    I have an RNASeq experiment of 12 samples for 2 conditions (6 samples per condition).
    Reads are paired-end, 100 bp.

    I performed differential expression analysis using DESeq2 and was surprised not to find a gene as significantly differentially expressed (it has been reported in an array and another RNASeq experiments studying the same conditions-but different samples).

    Here is the DESeq2 procedure I followed:

    Code:
    library("DESeq2")
    DataFrame=data.frame(Nom,Fichier,HIST)
    
    ################################################
    ###              Model                      ###
    ################################################
    
    dds_raw=DESeqDataSetFromHTSeqCount(DataFrame,"/home/Results/Data/26022016", design= ~HIST)
    str(colData(dds_raw)$HIST)
    
    dds_raw$HIST = relevel(dds_raw$HIST, ref="Cond1")
    dds <- DESeq(dds_raw, minReplicatesForReplace=6)
    
    
    
    #####################################
    ### Differential expression Tests ###
    #####################################
    resMLE <- results(dds, addMLE=TRUE, cooksCutoff=FALSE,alpha=0.05)
    summary(resMLE)
    resMLEOrdered <- resMLE[order(resMLE$padj),]
    
    write.table(resMLEOrdered, file="ResMLETable",quote=FALSE,row.names=TRUE, col.names=TRUE,sep="\t")
    write.table(counts(dds,normalized=TRUE), file="NormalizedTable",quote=FALSE,row.names=TRUE, col.names=TRUE,sep="\t")
    and the sessionInfo

    Code:
    > sessionInfo()
    R version 3.2.0 (2015-04-16)
    Platform: x86_64-pc-linux-gnu (64-bit)
    Running under: Ubuntu 14.04.3 LTS
    
    locale:
     [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=fr_FR.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=en_US.UTF-8   
     [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       
    
    attached base packages:
    [1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     
    
    other attached packages:
     [1] DESeq2_1.10.1              RcppArmadillo_0.6.500.4.0  Rcpp_0.12.3                SummarizedExperiment_1.0.2 Biobase_2.30.0             GenomicRanges_1.22.3      
     [7] GenomeInfoDb_1.6.1         IRanges_2.4.6              S4Vectors_0.8.7            BiocGenerics_0.16.1       
    
    loaded via a namespace (and not attached):
     [1] RColorBrewer_1.1-2   futile.logger_1.4.1  plyr_1.8.3           XVector_0.10.0       futile.options_1.0.0 tools_3.2.0          zlibbioc_1.16.0      rpart_4.1-10        
     [9] RSQLite_1.0.0        annotate_1.48.0      gtable_0.1.2         lattice_0.20-33      DBI_0.3.1            gridExtra_2.0.0      genefilter_1.52.1    cluster_2.0.3       
    [17] locfit_1.5-9.1       grid_3.2.0           nnet_7.3-11          AnnotationDbi_1.32.3 XML_3.98-1.3         survival_2.38-3      BiocParallel_1.4.3   foreign_0.8-66      
    [25] latticeExtra_0.6-26  Formula_1.2-1        geneplotter_1.48.0   ggplot2_2.0.0        lambda.r_1.1.7       Hmisc_3.17-1         scales_0.3.0         splines_3.2.0       
    [33] xtable_1.8-0         colorspace_1.2-6     acepack_1.3-3.3      munsell_0.4.2
    Here are the normalized counts and stats for my gene of interest:
    Code:
    90.7033407596	0	0	2.3672116923	0	0		939.0444914676	404.8903299633	665.5848352517	0	1034.9196508454	438.2486305103
    
    baseMean	log2FoldChange	lfcMLE	lfcSE	stat	pvalue	padj
    297.9798742075	-1.0090565228	-5.2239343815	0.5743666929	-1.7568158725	0.0789491987	0.4351815966
    The estimated log2FC is just above 1 but the adjusted p-value is very high, especially compare to this other gene, which is similar to my gene of interest, but with less magnitude and with low counts:

    Code:
    4.1228791254	0	0.8533500773	0	0	0.7423895021		99.6303301923	89.0758725919	128.7791432768	0	69.6249931217	43.9908663202
    
    baseMean	log2FoldChange	lfcMLE	lfcSE	stat	pvalue	padj
    36.4016520173	-3.353879756	-6.1662095289	0.6214945118	-5.3964752585	0.000000068	0.000052119
    In condition 2, one value is 0, as in condition 1. But it is also the case for the second gene. It seems weird to me that this second gene is differential but not my gene of interest. I am probably missing something here.
    Could somebody explain me what is going on please?

    Thank you in advance for any help,
    Jane

  • #2
    I suspect that had you not disabled outlier removal that you would have gotten the expected results. The lfcMLE variance for the gene of interest is going to be pretty high, which is why there's such a discrepancy between the lfcMLE and the shrunken lfc.

    BTW, if you haven't already done so, do look at a PCA plot of that data. From the two examples you showed I suspect that that will be telling.

    Comment


    • #3
      Thank you dpryan for your answer.

      Originally posted by dpryan View Post
      I suspect that had you not disabled outlier removal that you would have gotten the expected results. The lfcMLE variance for the gene of interest is going to be pretty high, which is why there's such a discrepancy between the lfcMLE and the shrunken lfc.
      There is no outlier to me in the 12 expression values of my gene of interest. The 12 values are in the same range, with one "outsider" in the second group - in its group. I guess this value has no chance to be flagged as outlier, since the outlier detection is based on all the samples, if I am right.

      I reran the analysis without disabling outlier removal:

      library("DESeq2")
      DataFrame=data.frame(Nom,Fichier,HIST)

      ###############################################
      ### Model ###
      ###############################################
      dds_raw=DESeqDataSetFromHTSeqCount(DataFrame,"/home/Results/Data/26022016", design= ~HIST)
      str(colData(dds_raw)$HIST)

      dds_raw$HIST = relevel(dds_raw$HIST, ref="Cond1")
      dds <- DESeq(dds_raw)


      #####################################
      ### Differential expression Tests ###
      #####################################
      resMLE <- results(dds, addMLE=TRUE, cooksCutoff=TRUE,alpha=0.05)
      summary(resMLE)
      resMLEOrdered <- resMLE[order(resMLE$padj),]

      write.table(resMLEOrdered, file="ResMLETable",quote=FALSE,row.names=TRUE, col.names=TRUE,sep="\t")
      write.table(counts(dds,normalized=TRUE), file="NormalizedTable",quote=FALSE,row.names=TRUE, col.names=TRUE,sep="\t")
      I got the exact same statistics for both genes.


      Originally posted by dpryan View Post
      BTW, if you haven't already done so, do look at a PCA plot of that data. From the two examples you showed I suspect that that will be telling.
      I performed PCA on my 12 samples based on the 2000 genes showing highest variablity, as performed by default wih plotPCA(). I am not sure what you meant: PCA on this lonely gene?

      Another suggestion?

      Comment


      • #4
        No, not PCA on a single gene, that wouldn't make any sense. The results of plotPCA on 2000 genes should suffice.

        Your 10th sample appears to be an outlier in both examples you posted, presumably the default cutoff isn't catching it (I would expect this to be group-based). Does the 10th sample cluster with the others correctly?

        Comment


        • #5
          Originally posted by dpryan View Post
          Does the 10th sample cluster with the others correctly?
          On the default plotPCA(), this 10th sample is indeed the only outlier among its group of 6 samples. In the second condition, the 6 samples show more variance.

          Comment


          • #6
            OK, so if you remove that one do you get the expected results? It's then likely that the defaults for outlier removal just aren't working ideally for you.

            Comment


            • #7
              Originally posted by dpryan View Post
              OK, so if you remove that one do you get the expected results? It's then likely that the defaults for outlier removal just aren't working ideally for you.
              Thank you for your help dpryan.

              After removing this sample (that is 6 vs 5 samples):
              Code:
              dds <- DESeq(dds_raw)
              resMLE <- results(dds, addMLE=TRUE, cooksCutoff=FALSE,alpha=0.05)
              I got the following normalized counts and stats:
              91.6129056374 0 0 2.3943405175 0 0 949.8886211299 409.1870751597 674.1357056464 1044.6794670421 443.478082265

              baseMean log2FoldChange lfcMLE lfcSE stat pvalue padj
              328.6705633998 -1.1765941557 -5.4877708718 0.5332537169 -2.2064434215 0.0273529676 0.2734306821
              And using:
              Code:
              dds <- DESeq(dds_raw, minReplicatesForReplace=5)
              resMLE <- results(dds, addMLE=TRUE, cooksCutoff=TRUE,alpha=0.05)
              I got very similar statistics. From that, I am not surprised because to me, there is no outlier in this gene.
              Nevertheless, the gene is still not significant! I am very surprised

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Essential Discoveries and Tools in Epitranscriptomics
                by seqadmin




                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                04-22-2024, 07:01 AM
              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Yesterday, 08:47 AM
              0 responses
              14 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              60 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              60 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              54 views
              0 likes
              Last Post seqadmin  
              Working...
              X