Seqanswers Leaderboard Ad

**Wolfgang Huber** · 11-23-2012, 01:43 PM

lmolokin,

Can you make sure you are using the most recent (release) version?
If so, what is the output of sessionInfo()?

Best wishes
Wolfgang

**lmolokin** · 12-13-2012, 07:49 AM

Wolfgang, thank you for your reply!

Here's the output:

R version 2.15.2 (2012-10-26)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] DESeq_1.10.1 lattice_0.20-10 locfit_1.5-8 Biobase_2.18.0 BiocGenerics_0.4.0

loaded via a namespace (and not attached):
[1] annotate_1.36.0 AnnotationDbi_1.20.2 DBI_0.2-5 genefilter_1.40.0
[5] geneplotter_1.36.0 grid_2.15.2 IRanges_1.16.4 MASS_7.3-22
[9] parallel_2.15.2 RColorBrewer_1.0-5 RSQLite_0.11.2 splines_2.15.2
[13] stats4_2.15.2 survival_2.36-14 tools_2.15.2 XML_3.95-0.1
[17] xtable_1.7-0

**Alex.Manuel** · 12-14-2012, 03:20 PM

Hi Community,

I'm sorry to disturb this thread, but I think my question fits to the topic...
Though I'm a rookie in informatics and thus in R, I have not the knowledge in scripting and can't take all its possibilities into advantage.
Does anyone know, if its possible to do something like a multivariate analysis, that take several factors (in my case gender=male/female, Parasite=n/a /allopatric/sympatric, Temperature=13°C/24°C, Infection_status=ctrl/not_Inf/Inf) that counts for every individual with a differing number of biol. replicates for each treatment into account?
So when I compare two specific treatments e.g. male_Inf_allo_13 vs male_Inf_allo_24, then they only differ in the factor Temp and I get DE Genes for that factor, but when I go down to this level I have only less replicates (e.g. 2 vs 3) and following less power. So when I do the analysis without care to gender and parasite population and pool them to Inf 13 vs Inf 24 than I don't know the influence of the gender and of the parasite population. I there a way to get DE sig. Gene tables for downstream analyses where all pooled factors are taken into account for the DE result?

Perhaps I didn't understand the Multifactor Analysis Chapter right, but as far as I can see, I just can visualize what effect one factor in a multifactor design has, but not take that into account for the DEsigGene calculation?!

Thanks in advance!
Alex

**dietmar13** · 12-15-2012, 12:03 AM

limma

the good old limma package (GNU R) can model your design, and will find significantly differentially expressed genes, taking into account all of your other factors together...

read the userguide from page 110, normalize the data with edgeR and use limma's voom() function to produce cpm values. build your design matrix and calculate the DEGs for your contrast.

and if you end with no significant genes, you can use limma's (m)roast function for a sensitive gene set analysis...

i have tried limma on RNA-seq data, and it was more powerfull than DEseq, edgeR, and NOIseq on a 12 vs 12 paired design, but less powerfull than samR.

dietmar

**Alex.Manuel** · 12-15-2012, 04:24 AM

Hi Dietmar,

thanks for your advice. I don't know if normalizing this dataset and then use it like micro-array data is the right way ?! I just read the thread http://seqanswers.com/forums/showthread.php?t=4349 where this topic was discussed and the opinions differ.
As I got in the whole bioinformatics NGS thing by chance,I didn't know before that this analysis would be that complicated... and so heterogeneous in its approaches. It is a real jungle. Thought it would more a straight forward thing... and for me its hard to understand all these theoretical tests background, since I'm no mathematician...

Did the DEG results differ in your case when you used the different analysis approaches?

And isn't there a way to get it with an approach in DESeq? Just get a DEG FDR corrected table of sig genes that where more than one factor is taken into account?

cheers
Alex

**kopi-o** · 12-15-2012, 04:29 AM

I don't see why you couldn't do it with DESeq (or edgeR) if you use the proper design matrix. I'm more familiar with edgeR and there are lots of illustrative "recipes" in its user guide to show you how to do it. Probably the same is true for DESeq.

**dietmar13** · 12-15-2012, 05:02 AM

voom (limma) does more

hi alex,

voom does more than only normalizing the counts...

Bioinformatics Seminar – Professor Gordon Smyth “Variance models for RNA-seq”

http://statsandgenomes.wordpress.com/2011/11/23/bioinformatics-seminar-professor-gordon-smyth-variance-models-for-rna-seq/

Gordon is well known for his development of the limma Bioconductor package for the analysis of differential gene expression using microarrays. More recently his group has led the way in the develop…

for my comparison, see the overlap of different methods in the attached png (numbers of DEGs (5% FDR) at the bottom of the graph: S1=SAMseq, S4=limma).

and also in the attachment you can see the gene count dependence of called genes for various methods (only bayseq seems rather robust against this effect, except - of course - at low sum counts) - this is a real problem for RNA-seq data, especially for all interpretation methods like gene set enrichment methods (pathway analysis like DAVID or GO analysis)...

Attached Files

**Alex.Manuel** · 12-15-2012, 07:00 AM

@kopi-o: Yes you're right there are examples for almost every case in edgeR, but its not the same with DESeq...and I just have the missing knowledge to get it right. So if I get a script that shows me how it works... :-)

@Dietmar: Thanks again. But first I want to try this with DEseq if possible, because I have less time to finish the analysis (its my master thesis, deadline in 1 month...) and I need a quick solution. I did all possible pairwise comparisons with DESeq, with counttables created with htqeq-count from tophat mapped reads. So am a bit familiar with DESeq, but apparently not enough...
One more thing to remark. Its an informative Venn Plot. A quite nice comparison of the different approaches. But I thought that its not always the best to use the programme that gives you the most DEGenes when there could be many false positives in the gain...but I'm just a rookie not fully understanding the principles of DE analysis.

cheers
Alex

**dietmar13** · 12-15-2012, 07:37 AM

yes,

of course, the number ist not all. i have also interpreted the DEGs and compared to microarray results and found no obvious hint that the larger number of genes from some methods are from higher numbers of false positives. furthermore, the 12 paired samples were from normal colon mucosa compared to colon carcinoma, and it will not be complete wrong to assume, that in reality (nearly) all genes are differentially expressed. therefore the DEG number found by RNA-seq are (only) limited by technical problems and biases and statistical power. this could also be inferred from the increasing percentage of called genes along higher gene count numbers (see picture two posts above).

and finally, samseq (written by robert tibshirani, a guy who usually knows what he does) uses permutation as method for calling of DEGs, probably the best statistical method if enough biological replicates are present -...

**Wolfgang Huber** · 01-25-2013, 12:25 AM

Originally posted by lmolokin View Post

Wolfgang, thank you for your reply!
Here's the output:

R version 2.15.2 (2012-10-26)
Platform: x86_64-w64-mingw32/x64 (64-bit)

[...]

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] DESeq_1.10.1 lattice_0.20-10 locfit_1.5-8 Biobase_2.18.0 BiocGenerics_0.4.0
[...]

Dear Aleksey,
the code you sent looks OK. But there are two issues:

1. in a private message, you mentioned "error" messages, whereas above there are only warnings. That is a difference. Can you clarify exactly what error or other problem you are getting, or why you think something is wrong?

2. what you are doing here is fitting a mathematical model to some data, and you are asking whether the fit is OK. This depends not only on the model, but on the combination of model and data. Can you provide the full information, i.e. a reproducible piece of code and input data that others can re-run to recreate the problem that you report? Only then it is possible to help.

Thank you and best wishes
Wolfgang

**Simon Anders** · 01-29-2013, 12:56 AM

Dear Alex

Originally posted by Alex.Manuel View Post

I'm sorry to disturb this thread, but I think my question fits to the topic...
Though I'm a rookie in informatics and thus in R, I have not the knowledge in scripting and can't take all its possibilities into advantage.
Does anyone know, if its possible to do something like a multivariate analysis, that take several factors (in my case gender=male/female, Parasite=n/a /allopatric/sympatric, Temperature=13°C/24°C, Infection_status=ctrl/not_Inf/Inf) that counts for every individual with a differing number of biol. replicates for each treatment into account?
So when I compare two specific treatments e.g. male_Inf_allo_13 vs male_Inf_allo_24, then they only differ in the factor Temp and I get DE Genes for that factor, but when I go down to this level I have only less replicates (e.g. 2 vs 3) and following less power. So when I do the analysis without care to gender and parasite population and pool them to Inf 13 vs Inf 24 than I don't know the influence of the gender and of the parasite population. I there a way to get DE sig. Gene tables for downstream analyses where all pooled factors are taken into account for the DE result?

Perhaps I didn't understand the Multifactor Analysis Chapter right, but as far as I can see, I just can visualize what effect one factor in a multifactor design has, but not take that into account for the DEsigGene calculation?!

Of course, your other factors can be taken into account for the DE analysis. For your example, proceed as follows:

Prepare a 'design table', i.e., a data frame with one row for each sample and one column for each factor, with column names 'gender', 'parasite', 'temperature', 'infection'. Then call 'newCountDataSet, passing your count table and the design table as arguments.

Then proceed as explained in the vignette section on multicariate designs: call 'estimateSizeFactors' and 'estimateDispersion' with 'methid="pooled-CR"'. To perform the test, you need to fit two models, the "full model" with all factors you want to account for, and the "reduced model", that has all the factors except for the factor you want to test for. For example:

Code:

fitFull <- fitNbinomGLMs( cds, count ~ gender + parasite + temperature + infection )
fitRed <- fitNbinomGLMs( cds, count ~ gender + parasite + infection )
pvals <- nbinomGLMTest( fitFull, fitRed )

This now checks for an influence of temperature on gene expression, controlling for all the other factors, i.e., you see only those differences between samples from different temperatures that cannot be explained by the samples differing in any of the other factors, too.

Of course, this specific result is maybe not biologically that useful: You are probably not interested in the effect of temperature alone, but in so-called interactions: For example, how does the effect of an interaction differ between high- and low-temperature samples? You can do this, too, maybe with

Code:

fitFull <- fitNbinomGLMs( cds, count ~ gender + parasite + temperature * infection )
fitRed <- fitNbinomGLMs( cds, count ~ gender + parasite + temperature + infection )
pvals <- nbinomGLMTest( fitFull, fitRed )

I warn you, however, against using this code without fully understanding what the "*" means.

You will need to read up on linear models, experimental design, and analysis of variance (anova). This is basic knowledge that you need for any of the analysis techniques suggested here (DESeq, edgeR, limma, voom, etc all use varaiants of ANOVA, and you need to understand how it works.)

If you want more specific advice here on designing your anova query, we need much more details on your experiment: What factor combinations do you have and how many samples for each, what is the biology behind the factors, and most importantly: What biological question are you addresisng?

**zaki** · 03-27-2013, 11:04 PM

Originally posted by Simon Anders View Post

Dear Alex
Of course, your other factors can be taken into account for the DE analysis. For your example, proceed as follows:

Prepare a 'design table', i.e., a data frame with one row for each sample and one column for each factor, with column names 'gender', 'parasite', 'temperature', 'infection'. Then call 'newCountDataSet, passing your count table and the design table as arguments.

Then proceed as explained in the vignette section on multicariate designs: call 'estimateSizeFactors' and 'estimateDispersion' with 'methid="pooled-CR"'. To perform the test, you need to fit two models, the "full model" with all factors you want to account for, and the "reduced model", that has all the factors except for the factor you want to test for. For example:

Dear Simon and kind members of the post,

Can I continue the discussion with my own question …It will be very helpful if members of the thread can share your opinions

I am interested in doing a multi-factor design, as I think this is most appropriate for my dataset (or at least I believe??,I am not too sure) My dataset is an RNA-seq count data (obtained by htseq-count) from 7 cell lines ; 4 are tumours which represents individual tumour cell culture (T100,T200,T300,T400) and 2 are normals which represents individual normal cell culture(N100, N200). I have no replicates for the cell lines (example - I have NO "T100_1" , "T100_2", T100 is one cell culture, T200 is one cell culture...etc. All individual cell lines were sequenced once using RNA-seq)

From prior data in the lab, we know certain cell lines (T100,T200) are sensitive to drug_A, and certain cell lines (T300,T400) are resistant to drug_A. A plausible research question (among others) I am trying to address – what are the genes differently expressed between the cell line which are sensitive to drug_A and cell line which are not sensitive to drug_A. This will inform us if expression can correlate drug response. I am also interested in other drugs (eg drug_B,C,D etc)

An example of my data would look as below

head(counts(cds))

Code:

                T200 T100 T300 T400 N100 N200
gene_1 1085  769  725  732  992  699
gene_2    0    0    0    0    0    0
gene_3 1250 1035 1969 2061 1718  766
gene_4  652  414  332  424  604  403
gene_5 1061  511  829 1116 1174  163
gene_6    8   11   22    0    8    2

And for my meta data ( I am not sure if this is an appropriate way to assign the meta values to this data)

pData(cds)

Code:

     sizeFactor sample        drug_A        drug_B
T200  0.9165466   T200     sensitive     sensitive
T100  1.0089820   T100     sensitive     sensitive
T300  1.1116448   T300 not_sensitive not_sensitive
T400  1.1357586   T400 not_sensitive        medium
N100  1.3424706 normal       unknown       unknown
N200  0.7023514 normal       unknown       unknown

some concerns about the meta data
1- I am grouping the two normal together (since I have no replicate in the tumour cell lines) - Is this justified?? I think yes - since the normal gene expression in normal samples should be the same (verified by PCA plot)
2 - No data is available on the normal cell line sensitivity to drug response therefore I just place it as unknown. Again - Is this ok??

Moving into Differential expression testing...

I proceeded with the generalized linear model (GLMs) for muli-factoral design (as described in vignette and the previous post)

Code:

cds = estimateSizeFactors( cds )
cds = estimateDispersions( cds )
fit1 = fitNbinomGLMs( cds, count ~ sample + drug_A )
fit2 = fitNbinomGLMs( cds, count ~ sample )

Some question I have in mind

1 Is the way I am assigning values to meta data correct?? In other word - To test my research question, should I conduct a multi-factor or a two-factor design??

Example of what I think two-factor design would be -

Code:

     sizeFactor sample
T200  0.9165466   T200
T100  1.0089820   T100
T300  1.1116448   T300
T400  1.1357586   T400
N100  1.3424706 normal
N200  0.7023514 normal

2 Question 2 assumes I am conducting a multi-factor design using the fit1 and fit2 function above. The question - Why do I get "NA" for $drug_Asensitve & $drug_Aunknown when I view the fit1 details? Also should there not be a $drug_Anot_sensitive parameter as well in the fit1 information??

str(fit1)

Code:

 'data.frame':	56299 obs. of  9 variables:
 $ (Intercept)    : num  9.76 NA 10.21 9 9.11 ...
 $ sampleT100     : num  -0.185 NA -0.209 -0.318 -0.129 ...
 $ sampleT200     : num  0.45 NA 0.202 0.476 1.064 ...
 $ sampleT300     : num  -0.41 NA 0.579 -0.776 0.429 ...
 $ sampleT400     : num  -0.427 NA 0.614 -0.454 0.827 ...
 $ drug_Asensitive: num  NA NA NA NA NA NA NA NA NA NA ...
 $ drug_Aunknown  : num  NA NA NA NA NA NA NA NA NA NA ...
 $ deviance       : num  0.536 NA 0.171 0.303 2.231 ...
 $ converged      : logi  TRUE NA TRUE TRUE TRUE TRUE ...
 - attr(*, "df.residual")= num 1

Any...feedback that you can provide would be tremendously valuable!

Many thanks kind member

**Simon Anders** · 04-03-2013, 05:46 AM

The two-way design is wrong; you only have one factor, namely "drug". "Sample" is not useful as factor as you have only one sample for each of its levels.

The one-way analysis should not include the "unknown" samples if you only want to compare "responsive" with "not_responsive".

**dimi** · 06-12-2013, 06:36 AM

another multi-factor design

Dear Simon,

I have a quite complex design that I need to handle.
I have samples with three conditions, namely control, drug1 and drug2 and I want to see the effect of each drug in comparison to the control.

The problem is that because the people who designed the experiment did not know exactly when each drug reaches the maximum effect, they have samples with different treatment duration (12, 18, 24 and 48 hrs, 3 replicates each).

Not that being enough, they have thought of "simulating" the noise that could be caused by different effects during the day and the drugs have been applied in different times of the day (I call this application time). So for each of the 3 biological replicates above, I have 1 at 9am, 1 at 5pm and 1 at 9pm.

As far as I understood, I should first subset my count dataset into 2 separate ones; a. control + drug1 included and b. control + drug2 included.

I did so to test for the effect of drug1 (after estimateDispersions with method="pooled-CR"):

Code:

cdsMulti1 <- cdsMulti[ , pData(cdsMulti)$condition %in% c( "Ctrl", "Drug1" ) ]
fitFull1 <- fitNbinomGLMs(cdsMulti1, count ~ applicationTime + treatmentDuration * condition)
fitDrug1 <- fitNbinomGLMs(cdsMulti1, count ~ applicationTime + treatmentDuration + condition)

But when looking at the structure of fitFull1, everything that has to do with condition Drug1 is NA. Is this to be expected?? Or is my design wrong???

After that, when I continue with the test, I do the following:

Code:

pvalsGLM1 <- nbinomGLMTest(fitFull1, fitDrug1)
padjGLM1 <- p.adjust(pvalsGLM1, method="BH")
table(padjGLM1 <.1)
# FALSE  TRUE 
# 25696   239
degs1 <- rownames(counts(cdsMulti1))[ padjGLM1 < .1 ]

And when I have a look at the degs1, the non-NAs should be my "focus" gene IDs, is that right?

Thanks in advance,
Dimitra

Code:

sessionInfo()
R version 2.15.3 (2013-03-01)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=C                 LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] FactoMineR_1.25            leaps_2.9                 
 [3] scatterplot3d_0.3-33       cluster_1.14.4            
 [5] lattice_0.20-15            ellipse_0.3-8             
 [7] car_2.0-17                 nnet_7.3-6                
 [9] MASS_7.3-23                PerformanceAnalytics_1.1.0
[11] xts_0.9-3                  zoo_1.7-9                 
[13] DESeq_1.8.3                locfit_1.5-9.1            
[15] Biobase_2.18.0             BiocGenerics_0.4.0        
[17] plotrix_3.4-7              RColorBrewer_1.0-5        
[19] scales_0.2.3               reshape2_1.2.2            
[21] plyr_1.8                   ggplot2_0.9.3.1           

loaded via a namespace (and not attached):
 [1] annotate_1.36.0      AnnotationDbi_1.20.7 colorspace_1.2-2    
 [4] DBI_0.2-7            dichromat_2.0-0      digest_0.6.3        
 [7] genefilter_1.40.0    geneplotter_1.36.0   grid_2.15.3         
[10] gtable_0.1.2         IRanges_1.16.6       KernSmooth_2.23-10  
[13] labeling_0.1         munsell_0.4          parallel_2.15.3     
[16] proto_0.3-10         RSQLite_0.11.3       splines_2.15.3      
[19] stats4_2.15.3        stringr_0.6.2        survival_2.37-4     
[22] tools_2.15.3         XML_3.96-1.1         xtable_1.7-1

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 18 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 17 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 49 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

DESeq, experimental design

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News