Seqanswers Leaderboard Ad

**dpryan** · 05-15-2014, 09:52 AM

I don't see anything obviously wrong at least. The fold change of a covariate is the change per unit increase/decrease.

**sindrle** · 05-15-2014, 09:58 AM

Thats great.

The reason Im asking is because Im thinking of using cross-validation:

http://www.bioconductor.org/packages/release/bioc/vignettes/MLSeq/inst/doc/MLSeq.pdf

But thats an area Im again is unfamiliar with..

**Michael Love** · 05-16-2014, 07:10 AM

What have I actually answered and how to interpret the fold change?

In the DESeq2 vignette, we have:

If the variable of interest is continuous-valued, then the reported log2 fold change is per unit of change of that variable.

When you test with a continuous covariate, x, you are supposing the that counts follow the following formula:

counts proportional to 2^(a * x)

Where a is the log2FoldChange.

**sindrle** · 05-22-2014, 08:51 AM

Hi again! I have now tried the same approach, but included one more time point (paired data). This is my design (I did not copy in the "patients" term, since its big):

des <- model.matrix(~patients+timepoints+dGIR)

> des
(Intercept) timepoints2 dGIR
1 1 0 4.401988
2 1 0 4.959263
3 1 0 4.386282
4 1 0 1.530456
5 1 0 1.806429
6 1 0 4.335553
7 1 0 4.206755
8 1 0 5.913078
9 1 0 2.427184
10 1 0 4.695216
11 1 0 8.343508
12 1 0 3.179318
13 1 0 4.197272
14 1 1 7.033431
15 1 1 4.395280
16 1 1 4.867971
17 1 1 4.651163
18 1 1 3.273684
19 1 1 4.918936
20 1 1 4.308176
21 1 1 7.803321
22 1 1 4.884005
23 1 1 5.275498
24 1 1 9.759326
25 1 1 4.972376
26 1 1 4.102564
27 1 0 6.592866
28 1 0 8.607653
29 1 0 5.313869
30 1 0 6.416293
31 1 0 6.021505
32 1 0 5.291005
33 1 0 8.586633
34 1 0 9.803922
35 1 0 8.777473
36 1 0 8.312925
37 1 0 9.480470
38 1 0 8.782435
39 1 0 6.975749
40 1 1 9.494048
41 1 1 8.817365
42 1 1 6.958610
43 1 1 8.303265
44 1 1 10.467523
45 1 1 14.250726
46 1 1 11.707989
47 1 1 10.816993
48 1 1 10.332435
49 1 1 12.841037
50 1 1 14.824017
51 1 1 6.165730
attr(,"assign")
[1] 0 1 2
attr(,"contrasts")
attr(,"contrasts")$timepoints
[1] "contr.treatment"

So, do I now get the genes that is related the change in GIR?

**Michael Love** · 05-22-2014, 01:10 PM

Yes.

I am writing more words so that i can post this post.

**sindrle** · 05-24-2014, 03:00 AM

Thank you for your time, if you don't mind I have two more questions:

1. Why does the result from only one time point not overlap at all with the test with two time points?

2. Why does the CPM values for a gene from the results not correlate with GIR using Spearman/Pearson correlation?

**Michael Love** · 05-25-2014, 07:01 AM

I recommend that you check more examples by eye, by plotting log2(normalized counts) over the continuous variable. Note that you need to split by group. The test is of linearity of log scale counts with your continuous variable, with the same slope in the two groups.

Here's a simple example with an artificial count matrix showing that the genes with smallest p-value for a test of the continuous variable 'x' are those the genes with high correlation of log2(norm. counts + 1) and that continuous variable 'x'.

Code:

> set.seed(1)
> m <- matrix(round(runif(1000,1,100)), ncol=10)
> dds <- DESeqDataSetFromMatrix(m, DataFrame(x=1:10), ~x)
> dds <- DESeq(dds, fitType="mean") # fitType=mean specified bc counts are very artificial
> res <- results(dds)
> order(res$pvalue)[1:5]
[1] 76 65 77 94 86
> order(-abs(cor(t(log2(counts(dds,normalized=TRUE) + 1)), dds$x)))[1:5]
[1]  1 65 76 77 94

**sindrle** · 05-26-2014, 12:36 AM

Thank you very much, most helpful.

Checking the test with 1 time point gave exactly what you said, but I did not check p-values of the correlations.

When testing with two time points I got only 15/20 when based on p-value of Pearson correlation:

#Getting counts
m <- log2(counts(dds,normalized=TRUE) + 1)

#Keeping only FDR < 10%
mt <- m[order(res$pvalue)[1:20],]

#Adding GIR values
mgir <- cbind(t(mt),GIR)

#Making delta values (time point 2 - time points 1)
delta <- (mgir[c(14:26, 39:50),]-(mgir[c(1:13, 27:38),]))

#Checking correlations, returns a matrix with gene vs GIR, rho and p-value
cor <- flattenSquareMatrix(cor.prob(delta))

#Of the 20 genes with FDR < 0.1, these have p < 0.05 running Pearson:
sum(cor[191:210,]$p < 0.05)
#[1] 15

That maybe ok, sad thing is the non existing overlap between the test with 1 time point and the test with 2 time points.

Im thinking of not testing log2(counts)+1 linearity against GIR, but rather CPM/FPKM vs GIR. Would maybe make more sense in regards to normalisation and linearity. Agree?

**Michael Love** · 05-26-2014, 05:22 AM

hi,

Just to be clear, I am not saying the negative binomial test is the same as Pearson correlation, or a test of Pearson correlation. I think you might be missing what is being tested for with DESeq2 with the multi-factor design. If you have ~ group + x, where x is continuous, you are looking for genes which have consistent slope in both groups. The naive correlation-based approach would be to subtract, for each group, the mean value of the log normalized counts. Then to look at plot(x, log-normalized-counts-centered-per-group). But you can also just look at plot(x, log-normalized-counts) and color the points by group. You should be able to see by eye what makes a gene get a low p-value.

Here is an example with Normal data:

Code:

> x <- c(1:10, 1:10)
> y <- rnorm(20, c(1:10, -(1:10)), 1)
> g <- rep(1:2,each=10)
> summary(lm(y ~ x)) # not small p-value for both groups
...
Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   0.5693     3.0898   0.184    0.856
x            -0.1494     0.4980  -0.300    0.768

> summary(lm(y[1:10] ~ x[1:10])) # small p-value for one group
...
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.39201    0.61561   0.637    0.542    
x[1:10]      0.84588    0.09921   8.526 2.75e-05 ***

> plot(x,y,col=g) # this explains why

There is not a way to perform analysis of counts ~ variables with DESeq2. There are a couple issues with linear models on raw counts but here is an important one: suppose you observe data of (x, count) like the following: (2,100), (3,50). Now you want to predict what the count will be for x=5. The linear model prediction would be a count of -50. Performing analysis with the log link assures that expected counts based on the model are always non-negative.

**Michael Love** · 05-26-2014, 06:04 AM

One addition, if you use the following transformed variable:

log2x = log2(x)

Then you will get the following relation at the level of raw counts:

log2(counts) ~ a log2x
counts ~ 2^(a log2(x))
counts ~ x^a

So then the coefficient from DESeq2 for log2x would be the exponent for x on the raw count scale.

**sindrle** · 05-27-2014, 06:50 AM

Thank you very much for all your help. For many reasons I started over using FeatureCount.
But when I try to run DESeq2 now, I get an error:

DESeq2 error: inv(): matrix appears to be singular

> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-apple-darwin13.1.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base

other attached packages:
[1] edgeR_3.6.2 limma_3.20.4 DESeq2_1.4.5 RcppArmadillo_0.4.300.0 Rcpp_0.11.1
[6] GenomicRanges_1.16.3 GenomeInfoDb_1.0.2 IRanges_1.22.7 BiocGenerics_0.10.0

loaded via a namespace (and not attached):
[1] annotate_1.42.0 AnnotationDbi_1.26.0 Biobase_2.24.0 DBI_0.2-7 genefilter_1.46.1
[6] geneplotter_1.42.0 grid_3.1.0 lattice_0.20-29 locfit_1.5-9.1 RColorBrewer_1.0-5
[11] RSQLite_0.11.4 splines_3.1.0 stats4_3.1.0 survival_2.37-7 tools_3.1.0
[16] XML_3.98-1.1 xtable_1.7-3 XVector_0.4.0

It should be the same analysis with one time point as I showed earlier (not included continuous variable yet):

> dds
class: DESeqDataSet
dim: 23710 26
exptData(0):
assays(1): counts
rownames(23710): DDX11L1 WASH7P ... GOLGA2P3Y GOLGA2P2Y
rowData metadata column names(0):
colnames(26): D102-A1 D104-A1 ... N187-A1 N188-A1
colData names(4): Targets Timepoint Group Patients

> as.data.frame( colData(dds) )

Timepoint Group Patients
1 pT2D 2
1 pT2D 4
1 pT2D 17
1 pT2D 21
1 pT2D 53
1 pT2D 55
1 pT2D 61
1 pT2D 62
1 pT2D 67
1 pT2D 73
1 pT2D 76
1 pT2D 77
1 pT2D 79
1 Control 1
1 Control 8
1 Control 13
1 Control 70
1 Control 71
1 Control 72
1 Control 75
1 Control 81
1 Control 82
1 Control 83
1 Control 86
1 Control 87
1 Control 88

Sorry for begin "off-topic", but I can't continue with your suggestions when getting this error.

Thanks!

**Michael Love** · 05-27-2014, 06:55 AM

[new post below]

**Michael Love** · 05-27-2014, 07:13 AM

sorry, I just saw your note that you didn't include the continuous covariate. Can you please include all code which produced this error?

If you are only analyzing one timepoint and not including the covariate, then your design is just:

~ patients ?

Fitting this model is problematic, because patients is not a grouping factor. It appears there are no repeated patients in this dataset. It should have prompted an error about not having any replicates. Is this a factor or an integer variable in your colData?

I would recommend a design of ~ timepoint + GIR, and leave out the patient term.

**sindrle** · 05-27-2014, 07:40 AM

Originally posted by Michael Love View Post

~ patients ?

Haha!
Yes, I was trying to sort everything nicely, since Im going to try a lot of things later. But of course I made a stupid mistake.

It works now, re-doing the analysis with one time point:

ddsA1GIR <- DESeqDataSetFromMatrix(
countData = assay(dds),
colData = colData(dds),
design = ~GIR)

ddsA1GIR <- DESeq(ddsA1GIR, test="Wald")

I will continue this test later.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 17 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 46 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

DESeq2: Continuos variable in the design

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News