SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Identify and visualize differentially expressed genes from RNA-Seq data? mediator Bioinformatics 7 05-09-2012 01:00 AM
DESeq: "NA" generated in the resulted differentially expressed genes idyll_ty RNA Sequencing 8 05-02-2012 04:28 PM
DEseq: zero differentially expressed regions found rebrendi Bioinformatics 3 12-22-2011 06:00 AM
DESeq and EdgeR: too many differentially expressed genes!?!? cutcopy11 Bioinformatics 5 12-08-2011 01:14 AM
RNA-Seq: Proportion statistics to detect differentially expressed genes: a comparison Newsbot! Literature Watch 0 06-09-2011 03:00 AM

Reply
 
Thread Tools
Old 05-02-2012, 06:41 PM   #1
Artur Jaroszewicz
Member
 
Location: Los Angeles

Join Date: Sep 2011
Posts: 45
Default DESeq not giving any differentially expressed genes in RNA seq

Hello All (and hopefully Simon Anders),

I've been using DESeq for testing for differential expression between samples. It's always worked great, but for some reason, I don't think it's working correctly for this one data set I have right now. Here's what I'm doing.

I have three biological samples (CD90, UtESC-, and UtESC+) with three replicates each. I align these to the mouse genome using Tophat, allowing only unique hits. I then use HTSeq with the Ensembl gtf file to find my gene matrix. I then run this matrix through DESeq to find the size factors, variance, and p-values for each pairwise comparison: CD90 to UtESC-, CD90 to UtESC+, and UtESC- to UtESC+.

Here are the size factors (they vary more than most of my data sets, but I don't think it explains what is happening here):

> sizeFactors(cds)
CD901 CD902 CD903 UtESC.1 UtESC.2 UtESC.3 UtESC.1.1 UtESC.2.1
0.9780371 1.3965205 1.3044467 1.7915295 0.3420241 0.6416566 0.5929506 2.4919984
UtESC.3.1
1.0039902

Here is the plot of variance:



Seems okay so far. However, when I calculate the p-values and fold changes, I get very strange results. The fold changes seem wrong (does not match up with the RPKM values I calculated) and the last comparison (UtESC+ vs UtESC-) gives almost only adjusted p-values of 1, none below 0.05, while the regular p-values give only about 200 p-values of less than 0.01. I've asked around, searched seqanswers, and looked over my code many times, but I can't find what I'm doing wrong (if anything). Does anybody have any troubleshooting suggestions? For reference, here is my code:

library("DESeq")
count_table <- read.table("UtESC_gene_array.txt", header=TRUE, row.names=1)
design <- data.frame(row.names = colnames(count_table), condition = c("CD901", "CD902", "CD903", "UtESC-1", "UtESC-2", "UtESC-3", "UtESC+1", "UtESC+2", "UtESC+3"),
libType = c("single-end", "single-end", "single-end", "single-end", "single-end", "single-end", "single-end", "single-end", "single-end") )
conds <- factor( c("CD", "CD", "CD", "UM", "UM", "UM", "UP", "UP", "UP") )
cds <- newCountDataSet( count_table, conds )
cds <- estimateSizeFactors( cds )
cds <- estimateDispersions( cds, fitType="local" )
res <- cbind( nbinomTest( cds, "CD", "UM" )[,c(1,6,8)], nbinomTest( cds, "CD", "UP" )[,c(6,8)], nbinomTest( cds, "UM", "UP" )[,c(6,8)] )
write.table(res, file="UtESC_take2_pvals.txt", append=FALSE, quote=FALSE, sep="\t")

If you would like a copy of the data for yourself to try, you can grab it here: http://pellegrini.mcdb.ucla.edu/Artur/UtESC_gene_array.txt

Thank you all in advance for your help!

Best,
Artur

Last edited by Artur Jaroszewicz; 05-03-2012 at 12:00 PM.
Artur Jaroszewicz is offline   Reply With Quote
Old 05-03-2012, 12:01 AM   #2
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 436
Default

What do your MA plots look like? Those would reveal if the normalization of library sizes went wonky.

Sometimes when I think the output of a tool looks weird I'll try another tool. Give edgeR a shot and see what the results look like.
sdriscoll is offline   Reply With Quote
Old 05-03-2012, 12:16 PM   #3
Artur Jaroszewicz
Member
 
Location: Los Angeles

Join Date: Sep 2011
Posts: 45
Default

It seems to look pretty good:


I'll give EdgeR a shot. Unfortunately, I have my entire pipeline set up with DESeq, so I'd rather keep it consistent across analyses.
Artur Jaroszewicz is offline   Reply With Quote
Old 05-03-2012, 02:43 PM   #4
Wolfgang Huber
Senior Member
 
Location: Heidelberg, Germany

Join Date: Aug 2009
Posts: 109
Default

Dear Artur,

for sure "unfortunately" is not the right sentiment here
Anyway, can you produce MA-plots like the one you did above for every pair of samples (i.e. for 9 choose 2 = 36 pairs)? The R function
Code:
 pairs(log2(exprs(cds)), pch=".")
does already most of that. If you are comfortable with R, you can add a function as the panel argument to do the 45-degree rotation and add the line at y=0.

To further explore data quality (and perhaps identify samples that went wrong, and are destroying your power to detect differential expression), try

Code:
library("arrayQualityMetrics")
vst = varianceStabilizingTransformation(cds)
arrayQualityMetrics(vst, intgroup="condition")
Please also try method = "per-condition" in the call to estimateDispersions, this could be less conservative if one of the conditions is (for some reason) a lot more noisy than the others.

PS The URL http://pelegrini.mcdb.ucla.edu/Artur...gene_array.txt did unfortunately not work when I tried ("Browsername can't find the server at pelegrini.mcdb.ucla.edu"). Could it be within your intranet?
__________________
Wolfgang Huber
EMBL
Wolfgang Huber is offline   Reply With Quote
Old 05-03-2012, 03:00 PM   #5
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 436
Default

I thought I'd try those snippits myself and I get some errors. Here's my session info:

Code:
> sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
I do not have the function "varianceStabilizingTransformation" and "exprs(cds)" returns an error:

> exprs(cds)
Error in function (classes, fdef, mtable) :
unable to find an inherited method for function "exprs", for signature "CountDataSet"
sdriscoll is offline   Reply With Quote
Old 05-03-2012, 10:05 PM   #6
dietmar13
Senior Member
 
Location: Vienna

Join Date: Mar 2010
Posts: 107
Default you could try SAMseq and limma-voom()

which are both capable of using multi-group designs.

SAMseq (from SAM R-package):
R: x <- your count matrix
R: y <- c(rep(1,3), rep(2,3), rep(3,3))
R: samfit <- SAMseq(x, y, resp.type = "Multiclass")

limma R package:
R: Counts <- your count matrix
R: nf <- calcNormFactors(Counts) [ from the edgeR package ]

R: f <- factor(targets$Target, levels=c("RNA1","RNA2","RNA3"))
R: design <- model.matrix(~0+f)
R: colnames(design) <- c("RNA1","RNA2","RNA3")
R: y <- voom(Counts,design,plot=TRUE,lib.size=colSums(Counts)*nf)

Pair-wise comparisons between the three groups
R: fit <- lmFit(y, design)
R: contrast.matrix <- makeContrasts(RNA2-RNA1, RNA3-RNA2, RNA3-RNA1,
levels=design)
R: fit2 <- contrasts.fit(fit, contrast.matrix)
R: fit2 <- eBayes(fit2)

e.g. RNA2 versus RNA1
R: topTable(fit2, coef=1, adjust="BH")
dietmar13 is offline   Reply With Quote
Old 05-04-2012, 12:05 AM   #7
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Judging from your dispersion plot, you have many genes with really high variance. Remember that a dispersion of .1 means that the gene typically varies by sqrt(.1)=30% between replicates, and then, to get significance, the change between conditions should be much larger.

Take a few of these genes with strong log2FoldChange and look at the normalized counts. Do you really see something, or is the variation withing group as strong as between groups?

(You know, it is entirely possible, of course, that DESeq did not work well on your data, but it is as likely that it is simply right, and the signal-to-noise ratio of your experiment is too bad to allow inference, i.e., that you simply don't have anything.)
Simon Anders is offline   Reply With Quote
Old 05-04-2012, 05:44 AM   #8
ETHANol
Senior Member
 
Location: Western Australia

Join Date: Feb 2010
Posts: 308
Default

I am not an expert here so take my word with a grain of salt, but I am currently looking at some cell lines which were sorted based on their expression on a single gene. There is a lot of variance between the cell lines, but there are also real differentially expressed genes between the two groups as well.

from the DESeq vignette
Quote:
As we estimated the dispersion from only two samples, we should expect the estimates to scatter with quite some sampling variance around their true values. Hence, we DESeq should not use the per-gene estimates directly in the test, because using too low dispersion values leads to false positives. Many of the values below the red line are likely to be underestimates of the true dispersions, and hence, it is prudent to instead rather use the fitted value. On the othe hand, not all of he values above the red line are overestimations, and hence, the conservative choice is to keep them instead of replacing them with their fitted values. If you do not like this default behaviour, you can change it with the option sharingMode of estimateDispersions. Note that DESeq orginally (as described in [1]) only used the fitted values (sharingMode="fit-only"). The current default (sharingMode="maximum") is more conservative.
I was getting no differentially expressed genes but when I changed to sharingMode="fit-only" I got some. We have a 8 biological replicates for each condition so I think this is a reasonable thing to do. My guess is the the whole analysis was being thrown off by the high variance of a lot of other genes. The risk is the less stringent variance calculation is giving us false positives.
__________________
--------------
Ethan
ETHANol is offline   Reply With Quote
Old 05-04-2012, 06:00 AM   #9
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Quote:
Originally Posted by ETHANol View Post
I was getting no differentially expressed genes but when I changed to sharingMode="fit-only" I got some. We have a 8 biological replicates for each condition so I think this is a reasonable thing to do.
No, quite the opposite! If you have 8 replicates per condition, you have enough power to get a reliable variance estimate for each gene without having to rely on other genes. So, you switch to sharingMode="gene-est-only", and then, high-variance genes will not reduce your power for any of the other genes.
Simon Anders is offline   Reply With Quote
Old 05-04-2012, 06:19 AM   #10
ETHANol
Senior Member
 
Location: Western Australia

Join Date: Feb 2010
Posts: 308
Default

Thanks Simon for the correction, that's what I was trying to say but my cluelessness resulted in what I wrote.
__________________
--------------
Ethan
ETHANol is offline   Reply With Quote
Old 05-04-2012, 09:15 AM   #11
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Ok, that's good. Now, you have 14 degrees of freedom for a pooled dispersion estimate (16 samples minus 2 conditions), Artur has only 6 DoF (9 samples minus 3 conditions). This is not quite enough to use "per-gene-est-only", I would say. I'm afraid his problem is really that his signal-to-noise ratio is too bad. Artur, you may need to tell us more about your experiment.
Simon Anders is offline   Reply With Quote
Old 05-05-2012, 09:10 AM   #12
Wolfgang Huber
Senior Member
 
Location: Heidelberg, Germany

Join Date: Aug 2009
Posts: 109
Default

Quote:
Originally Posted by sdriscoll View Post
I thought I'd try those snippits myself and I get some errors. I do not have the function "varianceStabilizingTransformation" and "exprs(cds)" returns an error:
Dear Sdriscoll

thank you for checking and reporting these errors. My apologies. That comes from sending untested code snippets late at night. Here's an example that works:

Code:
library("arrayQualityMetrics")
library("DESeq")
library("pasilla")

data("pasillaGenes")

pairs(log2(counts(pasillaGenes)+1),
  panel=function(x, y, ...) {
    points((x+y)/2, y-x, pch=".")
    abline(h=0, col="blue")
}, xlim=c(0, 16), ylim=c(-5,5))

cds = estimateDispersions(estimateSizeFactors(pasillaGenes), method="blind")
vst = DESeq:::varianceStabilizingTransformation(cds)
arrayQualityMetrics(vst, intgroup="condition",
    reporttitle="Quality metrics report for pasillaGenes", force=TRUE)



> sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: x86_64-apple-darwin10.8.0/x86_64 (64-bit)

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] pasilla_0.2.11             DEXSeq_1.2.0              
[3] DESeq_1.9.5                locfit_1.5-8              
[5] Biobase_2.16.0             BiocGenerics_0.2.0        
[7] arrayQualityMetrics_3.12.0 fortunes_1.5-0            

loaded via a namespace (and not attached):
 [1] AnnotationDbi_1.18.0  BeadDataPackR_1.8.0   BiocInstaller_1.4.4  
 [4] Biostrings_2.24.1     Cairo_1.5-1           DBI_0.2-5            
 [7] Hmisc_3.9-3           IRanges_1.14.2        KernSmooth_2.23-7    
[10] RColorBrewer_1.0-5    RCurl_1.91-1          RSQLite_0.11.1       
[13] SVGAnnotation_0.9-0   XML_3.9-4             affy_1.34.0          
[16] affyPLM_1.32.0        affyio_1.24.0         annotate_1.34.0      
[19] beadarray_2.6.0       biomaRt_2.12.0        cluster_1.14.2       
[22] colorspace_1.1-1      genefilter_1.38.0     geneplotter_1.34.0   
[25] grid_2.15.0           hwriter_1.3           lattice_0.20-6       
[28] latticeExtra_0.6-19   limma_3.12.0          plyr_1.7.1           
[31] preprocessCore_1.18.0 reshape2_1.2.1        setRNG_2011.11-2     
[34] splines_2.15.0        statmod_1.4.14        stats4_2.15.0        
[37] stringr_0.6           survival_2.36-14      tools_2.15.0         
[40] vsn_3.24.0            xtable_1.7-0          zlibbioc_1.2.0
You will need DESeq>=1.9.4 for this (i.e. a version more recent than the last release of Bioconductor from April 2012). If you are using R 2.15, and are not normally subscribing to the devel branch of Bioconductor, then the best way to install DESeq is via

Code:
source("http://bioconductor.org/biocLite.R"); useDevel(TRUE); biocLite("DESeq"); useDevel(old)
The above mentioned prefix DESeq::: in the call to varianceStabilizingTransformation is only needed for DESeq <=1.9.5, in more recent versions this is properly exported.

I hope this helps, best wishes
Wolfgang
__________________
Wolfgang Huber
EMBL
Wolfgang Huber is offline   Reply With Quote
Old 05-06-2012, 09:28 AM   #13
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 436
Default

Thanks for the explanation. Late night code snippets...that's funny and it happens to everyone that writes code.
sdriscoll is offline   Reply With Quote
Old 05-06-2012, 09:49 AM   #14
ETHANol
Senior Member
 
Location: Western Australia

Join Date: Feb 2010
Posts: 308
Default

Delivery of code snipits after midnight on SEQanswers. You deserve an award for that or at least to have your next grant approved.
__________________
--------------
Ethan
ETHANol is offline   Reply With Quote
Old 05-07-2012, 01:49 PM   #15
Artur Jaroszewicz
Member
 
Location: Los Angeles

Join Date: Sep 2011
Posts: 45
Default

Hi everyone,

Thank you so much for your input so far. Simon, I am beginning to think that the replicates may indeed be too noisy to infer anything. Still, it's strange. I have a question for you about the way you determine fold change. I see that you estimate the total abundance of a given gene by dividing the raw reads by the size factors, then find the unweighted average of the replicates to obtain a mean level for each sample. This is the only way to do the differential testing, from what I understand. However, it leads to some results that I'm not sure I fully grasp or agree with. You calculate fold change, etc., from these numbers. Isn't it more accurate to take the weighted mean of these numbers? In my analysis, it's caused some problems. I calculate RPKM by taking the total counts per gene for each sample and dividing by the total number of reads in the sample (and gene length). This gives replicates with more reads more strength, which I think should be the case. Just FYI, my size factors here are:
CD901 0.9780371
CD902 1.3965205
CD903 1.3044467
UtESC.1 1.7915295
UtESC.2 0.3420241
UtESC.3 0.6416566
UtESC.1.1 0.5929506
UtESC.2.1 2.4919984
UtESC.3.1 1.0039902
So in the last sample, replicates 1 and 2 have a five-fold size difference, yet the same weight in the differential test. Because of this discrepancy, I get strange artifacts in certain differentially expressed genes which have sample 1 higher than sample 2 in one method of estimation and lower in the other method. Do you have any explanations or suggestions for what I should do differently?

Wolfgang, I realized that I had the wrong link. I edited it, and now it should work. And you're right, I should feel very fortunate for having DESeq in my pipeline =]

Here's some background information sent from my colleague:

We study the mouse reproductive system, and have identified a signature for adult mouse endometrial epithelial cells. We have collected FACS sorted fractions of the epithelial stem cell fraction (UtESC+), the epithelial non-stem cell fraction (UtESC-), and the stromal fraction (CD90).

I re-ran the test using
> cds <- estimateDispersions( cds, method="per-condition" )
and looked at the p-values again. There are only 13 genes of ~30,000 with adjusted p-values of less than 0.01, and a more reasonable 309 genes with non-adjusted p-values of less than 0.01. These are much lower numbers than I usually get, but it may be the only thing I can do with this data. Do you agree?

Thank you for your continued help,
Artur
Artur Jaroszewicz is offline   Reply With Quote
Old 05-07-2012, 02:07 PM   #16
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

1. Actually, libraries with higher size factors are given more weight in the test (i.e., when calculating the p value). This is because we compare the sums of the unnormalized counts with what one should expect according to the size factors. (For details, please see the fine print in our paper.) For the fold change, we simply calculate the ration of the averages of the normalized counts, which is straight-forward. However, you may have a point that it would be more consistent to weigh the sum according to size factors or fitted variances.

2. I don't think that this is the cause of the artifacts you see. However, as you are worried about the size factors, you may want to double check that they are good estimates. Make an MA plot, i.e. a log-log plot of means versus ratios of the normalized counts between all pairs of samples and check whether the bulk of the genes is centered around zero log fold change.

3. Controlling false discovery rate at 0.01 sounds extremely stringent to me. Remember that controlling FDR at x% means that your hit list can be expected to have at most x% false positives. It is common to cut adjusted p values at 5% or 10% because this is quite a reasonable FDR that one can usually well live with.
Simon Anders is offline   Reply With Quote
Old 05-07-2012, 03:12 PM   #17
Artur Jaroszewicz
Member
 
Location: Los Angeles

Join Date: Sep 2011
Posts: 45
Default

Excellent. Thank you so much for your input. I will calculate my own expression values and fold changes from now on. One of my earlier posts in this thread was an MA plot; you can see it above. I think it looks pretty good and centered around zero.

As far as the adjusted p-values go, I will start using higher values, and filtering with both adjusted and non-adjusted p-values. Sadly, even with a threshold of 10% for padj, I still get only 50 genes. This is, I'm sure, attributable to the variation in the samples.

Thank you again for your help!

Artur
Artur Jaroszewicz is offline   Reply With Quote
Old 05-07-2012, 11:37 PM   #18
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Quote:
Originally Posted by Artur Jaroszewicz View Post
Excellent. Thank you so much for your input. I will calculate my own expression values and fold changes from now on. One of my earlier posts in this thread was an MA plot; you can see it above. I think it looks pretty good and centered around zero.
Yes, but it is an MA plot of averages across replicates. I suggested you do the same but only comparing two individual samples (both of the same and of different conditions) at a time. You

Quote:
As far as the adjusted p-values go, I will start using higher values, and filtering with both adjusted and non-adjusted p-values.
You should never look at non-adjusted p-values for filtering. Chose an FDR and cut the adjusted p-values there.

Quote:
Sadly, even with a threshold of 10% for padj, I still get only 50 genes. This is, I'm sure, attributable to the variation in the samples.
Well, you cannot always have great data.

Two more things to check: Try to use only six of your nine samples. maybe only one of the three FACS-sorted cell type has high variance, and you can still get good results for a comparison of the other two.

Second: Your data is not grouped, is it? If, say, samples CD90.1, UtESC-.1, and UtESC+.1 are from one mouse, CD90.2, UtESC-.2 and UtESC+.2 from a second mouse etc., a GLM will give you a lot of extra power.
Simon Anders is offline   Reply With Quote
Old 06-29-2012, 05:35 PM   #19
Him26
Member
 
Location: Seoul, Kr

Join Date: Aug 2011
Posts: 20
Default Dispersion plot

HI,

I am completely newbie in regards to statistics.
I've been following Seqanswers forum and published papers on how to go about analyzing the data.
I have a two normal vs. two drug treated samples which we prep for RNA-seq. After running the data through top hat and HTseq to get the read counts I run the DEseq but I don't see a difference in two conditions based on the DEseq results. When I look at the dispersion plot I get very weird plots.
I can tell something is wrong but I don't know what this plot means to understand where the problem is coming from. Can anyone help?
Thank you.

Him26
Attached Images
File Type: jpeg Rplot.jpeg (61.9 KB, 140 views)
File Type: jpeg Rplot01.jpeg (62.5 KB, 111 views)
Him26 is offline   Reply With Quote
Old 06-30-2012, 02:37 PM   #20
Artur Jaroszewicz
Member
 
Location: Los Angeles

Join Date: Sep 2011
Posts: 45
Default

Hi,
Welcome. I am also somewhat a newbie (been in bioinformatics since last September), but maybe I can help somewhat. How are you estimating your dispersions? Do you have replicates? Your plots do look a little weird.. Are these log plots?
Artur
Artur Jaroszewicz is offline   Reply With Quote
Reply

Tags
deseq, p values, rna seq, size factors

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:46 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO