Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • DESeq, experimental design

    Hi Everyone,
    I'm looking for some help to make sure my data frame is set up correctly. My experiment has 30 samples from 10 subjects, 3 time points each, and I want to tell the software to look for differential gene expression between 3 time points (or essentially baseline (t1) vs t2, and baseline vs t3 while grouping the samples that come from the same subject (see the attachment). The code I'm using is below. I get as far attempting to estimate the dispersion but run into greater than 50 "glm.fit: algorithm did not converge" warnings.

    Code:
    count_table <- read.csv("C:\\filepath...file.csv", header=TRUE, row.names=1)
    head(count_table)
    expt_design1 <- data.frame(row.names = colnames(count_table), subject = c(rep("subj01",3),rep("subj02",3),rep("subj03",3),rep("subj04",3),rep("subj05",3),rep("subj06",3),rep("subj07",3),rep("subj08",3),rep("subj09",3),rep("subj10",3)),time = c("t1","t2","t3","t1","t2","t3","t1","t2","t3","t1","t2","t3","t1","t2","t3","t1","t2","t3","t1","t2","t3","t1","t2","t3","t1","t2","t3","t1","t2","t3"))
    cds <- newCountDataSet(count_table, expt_design)
    head(counts(cds))
    cds = estimateSizeFactors(cds)
    sizeFactors(cds)
    head(counts(cds,normalized=TRUE))
    cds = estimateDispersions(cds, method = c("pooled-CR"), sharingMode = c("gene-est-only"),modelFormula = (count ~ time + subject))
    I'm not sure this is the right way to group samples by subject...

    I'm still very much in the process of learning how to use this package and would appreciate anyone's help on this immensely! Thanks in advance!
    Attached Files

  • #2
    lmolokin,

    Can you make sure you are using the most recent (release) version?
    If so, what is the output of sessionInfo()?

    Best wishes
    Wolfgang
    Wolfgang Huber
    EMBL

    Comment


    • #3
      Wolfgang, thank you for your reply!

      Here's the output:

      R version 2.15.2 (2012-10-26)
      Platform: x86_64-w64-mingw32/x64 (64-bit)

      locale:
      [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
      [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
      [5] LC_TIME=English_United States.1252

      attached base packages:
      [1] stats graphics grDevices utils datasets methods base

      other attached packages:
      [1] DESeq_1.10.1 lattice_0.20-10 locfit_1.5-8 Biobase_2.18.0 BiocGenerics_0.4.0

      loaded via a namespace (and not attached):
      [1] annotate_1.36.0 AnnotationDbi_1.20.2 DBI_0.2-5 genefilter_1.40.0
      [5] geneplotter_1.36.0 grid_2.15.2 IRanges_1.16.4 MASS_7.3-22
      [9] parallel_2.15.2 RColorBrewer_1.0-5 RSQLite_0.11.2 splines_2.15.2
      [13] stats4_2.15.2 survival_2.36-14 tools_2.15.2 XML_3.95-0.1
      [17] xtable_1.7-0

      Comment


      • #4
        Hi Community,

        I'm sorry to disturb this thread, but I think my question fits to the topic...
        Though I'm a rookie in informatics and thus in R, I have not the knowledge in scripting and can't take all its possibilities into advantage.
        Does anyone know, if its possible to do something like a multivariate analysis, that take several factors (in my case gender=male/female, Parasite=n/a /allopatric/sympatric, Temperature=13°C/24°C, Infection_status=ctrl/not_Inf/Inf) that counts for every individual with a differing number of biol. replicates for each treatment into account?
        So when I compare two specific treatments e.g. male_Inf_allo_13 vs male_Inf_allo_24, then they only differ in the factor Temp and I get DE Genes for that factor, but when I go down to this level I have only less replicates (e.g. 2 vs 3) and following less power. So when I do the analysis without care to gender and parasite population and pool them to Inf 13 vs Inf 24 than I don't know the influence of the gender and of the parasite population. I there a way to get DE sig. Gene tables for downstream analyses where all pooled factors are taken into account for the DE result?

        Perhaps I didn't understand the Multifactor Analysis Chapter right, but as far as I can see, I just can visualize what effect one factor in a multifactor design has, but not take that into account for the DEsigGene calculation?!

        Thanks in advance!
        Alex

        Comment


        • #5
          limma

          the good old limma package (GNU R) can model your design, and will find significantly differentially expressed genes, taking into account all of your other factors together...

          read the userguide from page 110, normalize the data with edgeR and use limma's voom() function to produce cpm values. build your design matrix and calculate the DEGs for your contrast.

          and if you end with no significant genes, you can use limma's (m)roast function for a sensitive gene set analysis...

          i have tried limma on RNA-seq data, and it was more powerfull than DEseq, edgeR, and NOIseq on a 12 vs 12 paired design, but less powerfull than samR.


          dietmar

          Comment


          • #6
            Hi Dietmar,

            thanks for your advice. I don't know if normalizing this dataset and then use it like micro-array data is the right way ?! I just read the thread http://seqanswers.com/forums/showthread.php?t=4349 where this topic was discussed and the opinions differ.
            As I got in the whole bioinformatics NGS thing by chance,I didn't know before that this analysis would be that complicated... and so heterogeneous in its approaches. It is a real jungle. Thought it would more a straight forward thing... and for me its hard to understand all these theoretical tests background, since I'm no mathematician...

            Did the DEG results differ in your case when you used the different analysis approaches?

            And isn't there a way to get it with an approach in DESeq? Just get a DEG FDR corrected table of sig genes that where more than one factor is taken into account?

            cheers
            Alex

            Comment


            • #7
              I don't see why you couldn't do it with DESeq (or edgeR) if you use the proper design matrix. I'm more familiar with edgeR and there are lots of illustrative "recipes" in its user guide to show you how to do it. Probably the same is true for DESeq.

              Comment


              • #8
                voom (limma) does more

                hi alex,

                voom does more than only normalizing the counts...

                Gordon is well known for his development of the limma Bioconductor package for the analysis of differential gene expression using microarrays. More recently his group has led the way in the develop…


                for my comparison, see the overlap of different methods in the attached png (numbers of DEGs (5% FDR) at the bottom of the graph: S1=SAMseq, S4=limma).

                and also in the attachment you can see the gene count dependence of called genes for various methods (only bayseq seems rather robust against this effect, except - of course - at low sum counts) - this is a real problem for RNA-seq data, especially for all interpretation methods like gene set enrichment methods (pathway analysis like DAVID or GO analysis)...
                Attached Files

                Comment


                • #9
                  @kopi-o: Yes you're right there are examples for almost every case in edgeR, but its not the same with DESeq...and I just have the missing knowledge to get it right. So if I get a script that shows me how it works... :-)

                  @Dietmar: Thanks again. But first I want to try this with DEseq if possible, because I have less time to finish the analysis (its my master thesis, deadline in 1 month...) and I need a quick solution. I did all possible pairwise comparisons with DESeq, with counttables created with htqeq-count from tophat mapped reads. So am a bit familiar with DESeq, but apparently not enough...
                  One more thing to remark. Its an informative Venn Plot. A quite nice comparison of the different approaches. But I thought that its not always the best to use the programme that gives you the most DEGenes when there could be many false positives in the gain...but I'm just a rookie not fully understanding the principles of DE analysis.

                  cheers
                  Alex

                  Comment


                  • #10
                    yes,

                    of course, the number ist not all. i have also interpreted the DEGs and compared to microarray results and found no obvious hint that the larger number of genes from some methods are from higher numbers of false positives. furthermore, the 12 paired samples were from normal colon mucosa compared to colon carcinoma, and it will not be complete wrong to assume, that in reality (nearly) all genes are differentially expressed. therefore the DEG number found by RNA-seq are (only) limited by technical problems and biases and statistical power. this could also be inferred from the increasing percentage of called genes along higher gene count numbers (see picture two posts above).

                    and finally, samseq (written by robert tibshirani, a guy who usually knows what he does) uses permutation as method for calling of DEGs, probably the best statistical method if enough biological replicates are present -...

                    Comment


                    • #11
                      Originally posted by lmolokin View Post
                      Wolfgang, thank you for your reply!
                      Here's the output:

                      R version 2.15.2 (2012-10-26)
                      Platform: x86_64-w64-mingw32/x64 (64-bit)

                      [...]

                      attached base packages:
                      [1] stats graphics grDevices utils datasets methods base

                      other attached packages:
                      [1] DESeq_1.10.1 lattice_0.20-10 locfit_1.5-8 Biobase_2.18.0 BiocGenerics_0.4.0
                      [...]
                      Dear Aleksey,
                      the code you sent looks OK. But there are two issues:

                      1. in a private message, you mentioned "error" messages, whereas above there are only warnings. That is a difference. Can you clarify exactly what error or other problem you are getting, or why you think something is wrong?

                      2. what you are doing here is fitting a mathematical model to some data, and you are asking whether the fit is OK. This depends not only on the model, but on the combination of model and data. Can you provide the full information, i.e. a reproducible piece of code and input data that others can re-run to recreate the problem that you report? Only then it is possible to help.

                      Thank you and best wishes
                      Wolfgang
                      Wolfgang Huber
                      EMBL

                      Comment


                      • #12
                        Dear Alex

                        Originally posted by Alex.Manuel View Post
                        I'm sorry to disturb this thread, but I think my question fits to the topic...
                        Though I'm a rookie in informatics and thus in R, I have not the knowledge in scripting and can't take all its possibilities into advantage.
                        Does anyone know, if its possible to do something like a multivariate analysis, that take several factors (in my case gender=male/female, Parasite=n/a /allopatric/sympatric, Temperature=13°C/24°C, Infection_status=ctrl/not_Inf/Inf) that counts for every individual with a differing number of biol. replicates for each treatment into account?
                        So when I compare two specific treatments e.g. male_Inf_allo_13 vs male_Inf_allo_24, then they only differ in the factor Temp and I get DE Genes for that factor, but when I go down to this level I have only less replicates (e.g. 2 vs 3) and following less power. So when I do the analysis without care to gender and parasite population and pool them to Inf 13 vs Inf 24 than I don't know the influence of the gender and of the parasite population. I there a way to get DE sig. Gene tables for downstream analyses where all pooled factors are taken into account for the DE result?

                        Perhaps I didn't understand the Multifactor Analysis Chapter right, but as far as I can see, I just can visualize what effect one factor in a multifactor design has, but not take that into account for the DEsigGene calculation?!
                        Of course, your other factors can be taken into account for the DE analysis. For your example, proceed as follows:

                        Prepare a 'design table', i.e., a data frame with one row for each sample and one column for each factor, with column names 'gender', 'parasite', 'temperature', 'infection'. Then call 'newCountDataSet, passing your count table and the design table as arguments.

                        Then proceed as explained in the vignette section on multicariate designs: call 'estimateSizeFactors' and 'estimateDispersion' with 'methid="pooled-CR"'. To perform the test, you need to fit two models, the "full model" with all factors you want to account for, and the "reduced model", that has all the factors except for the factor you want to test for. For example:

                        Code:
                        fitFull <- fitNbinomGLMs( cds, count ~ gender + parasite + temperature + infection )
                        fitRed <- fitNbinomGLMs( cds, count ~ gender + parasite + infection )
                        pvals <- nbinomGLMTest( fitFull, fitRed )
                        This now checks for an influence of temperature on gene expression, controlling for all the other factors, i.e., you see only those differences between samples from different temperatures that cannot be explained by the samples differing in any of the other factors, too.

                        Of course, this specific result is maybe not biologically that useful: You are probably not interested in the effect of temperature alone, but in so-called interactions: For example, how does the effect of an interaction differ between high- and low-temperature samples? You can do this, too, maybe with

                        Code:
                        fitFull <- fitNbinomGLMs( cds, count ~ gender + parasite + temperature * infection )
                        fitRed <- fitNbinomGLMs( cds, count ~ gender + parasite + temperature + infection )
                        pvals <- nbinomGLMTest( fitFull, fitRed )
                        I warn you, however, against using this code without fully understanding what the "*" means.

                        You will need to read up on linear models, experimental design, and analysis of variance (anova). This is basic knowledge that you need for any of the analysis techniques suggested here (DESeq, edgeR, limma, voom, etc all use varaiants of ANOVA, and you need to understand how it works.)

                        If you want more specific advice here on designing your anova query, we need much more details on your experiment: What factor combinations do you have and how many samples for each, what is the biology behind the factors, and most importantly: What biological question are you addresisng?

                        Comment


                        • #13
                          Originally posted by Simon Anders View Post
                          Dear Alex
                          Of course, your other factors can be taken into account for the DE analysis. For your example, proceed as follows:

                          Prepare a 'design table', i.e., a data frame with one row for each sample and one column for each factor, with column names 'gender', 'parasite', 'temperature', 'infection'. Then call 'newCountDataSet, passing your count table and the design table as arguments.

                          Then proceed as explained in the vignette section on multicariate designs: call 'estimateSizeFactors' and 'estimateDispersion' with 'methid="pooled-CR"'. To perform the test, you need to fit two models, the "full model" with all factors you want to account for, and the "reduced model", that has all the factors except for the factor you want to test for. For example:
                          Dear Simon and kind members of the post,

                          Can I continue the discussion with my own question …It will be very helpful if members of the thread can share your opinions

                          I am interested in doing a multi-factor design, as I think this is most appropriate for my dataset (or at least I believe??,I am not too sure) My dataset is an RNA-seq count data (obtained by htseq-count) from 7 cell lines ; 4 are tumours which represents individual tumour cell culture (T100,T200,T300,T400) and 2 are normals which represents individual normal cell culture(N100, N200). I have no replicates for the cell lines (example - I have NO "T100_1" , "T100_2", T100 is one cell culture, T200 is one cell culture...etc. All individual cell lines were sequenced once using RNA-seq)

                          From prior data in the lab, we know certain cell lines (T100,T200) are sensitive to drug_A, and certain cell lines (T300,T400) are resistant to drug_A. A plausible research question (among others) I am trying to address – what are the genes differently expressed between the cell line which are sensitive to drug_A and cell line which are not sensitive to drug_A. This will inform us if expression can correlate drug response. I am also interested in other drugs (eg drug_B,C,D etc)

                          An example of my data would look as below

                          head(counts(cds))
                          Code:
                                          T200 T100 T300 T400 N100 N200
                          gene_1 1085  769  725  732  992  699
                          gene_2    0    0    0    0    0    0
                          gene_3 1250 1035 1969 2061 1718  766
                          gene_4  652  414  332  424  604  403
                          gene_5 1061  511  829 1116 1174  163
                          gene_6    8   11   22    0    8    2
                          And for my meta data ( I am not sure if this is an appropriate way to assign the meta values to this data)

                          pData(cds)
                          Code:
                               sizeFactor sample        drug_A        drug_B
                          T200  0.9165466   T200     sensitive     sensitive
                          T100  1.0089820   T100     sensitive     sensitive
                          T300  1.1116448   T300 not_sensitive not_sensitive
                          T400  1.1357586   T400 not_sensitive        medium
                          N100  1.3424706 normal       unknown       unknown
                          N200  0.7023514 normal       unknown       unknown
                          some concerns about the meta data
                          1- I am grouping the two normal together (since I have no replicate in the tumour cell lines) - Is this justified?? I think yes - since the normal gene expression in normal samples should be the same (verified by PCA plot)
                          2 - No data is available on the normal cell line sensitivity to drug response therefore I just place it as unknown. Again - Is this ok??

                          Moving into Differential expression testing...
                          I proceeded with the generalized linear model (GLMs) for muli-factoral design (as described in vignette and the previous post)
                          Code:
                          cds = estimateSizeFactors( cds )
                          cds = estimateDispersions( cds )
                          fit1 = fitNbinomGLMs( cds, count ~ sample + drug_A )
                          fit2 = fitNbinomGLMs( cds, count ~ sample )
                          Some question I have in mind

                          1 Is the way I am assigning values to meta data correct?? In other word - To test my research question, should I conduct a multi-factor or a two-factor design??

                          Example of what I think two-factor design would be -
                          Code:
                               sizeFactor sample
                          T200  0.9165466   T200
                          T100  1.0089820   T100
                          T300  1.1116448   T300
                          T400  1.1357586   T400
                          N100  1.3424706 normal
                          N200  0.7023514 normal
                          2 Question 2 assumes I am conducting a multi-factor design using the fit1 and fit2 function above. The question - Why do I get "NA" for $drug_Asensitve & $drug_Aunknown when I view the fit1 details? Also should there not be a $drug_Anot_sensitive parameter as well in the fit1 information??

                          str(fit1)
                          Code:
                           'data.frame':	56299 obs. of  9 variables:
                           $ (Intercept)    : num  9.76 NA 10.21 9 9.11 ...
                           $ sampleT100     : num  -0.185 NA -0.209 -0.318 -0.129 ...
                           $ sampleT200     : num  0.45 NA 0.202 0.476 1.064 ...
                           $ sampleT300     : num  -0.41 NA 0.579 -0.776 0.429 ...
                           $ sampleT400     : num  -0.427 NA 0.614 -0.454 0.827 ...
                           $ drug_Asensitive: num  NA NA NA NA NA NA NA NA NA NA ...
                           $ drug_Aunknown  : num  NA NA NA NA NA NA NA NA NA NA ...
                           $ deviance       : num  0.536 NA 0.171 0.303 2.231 ...
                           $ converged      : logi  TRUE NA TRUE TRUE TRUE TRUE ...
                           - attr(*, "df.residual")= num 1

                          Any...feedback that you can provide would be tremendously valuable!

                          Many thanks kind member

                          Comment


                          • #14
                            The two-way design is wrong; you only have one factor, namely "drug". "Sample" is not useful as factor as you have only one sample for each of its levels.

                            The one-way analysis should not include the "unknown" samples if you only want to compare "responsive" with "not_responsive".

                            Comment


                            • #15
                              another multi-factor design

                              Dear Simon,

                              I have a quite complex design that I need to handle.
                              I have samples with three conditions, namely control, drug1 and drug2 and I want to see the effect of each drug in comparison to the control.

                              The problem is that because the people who designed the experiment did not know exactly when each drug reaches the maximum effect, they have samples with different treatment duration (12, 18, 24 and 48 hrs, 3 replicates each).

                              Not that being enough, they have thought of "simulating" the noise that could be caused by different effects during the day and the drugs have been applied in different times of the day (I call this application time). So for each of the 3 biological replicates above, I have 1 at 9am, 1 at 5pm and 1 at 9pm.

                              As far as I understood, I should first subset my count dataset into 2 separate ones; a. control + drug1 included and b. control + drug2 included.

                              I did so to test for the effect of drug1 (after estimateDispersions with method="pooled-CR"):
                              Code:
                              cdsMulti1 <- cdsMulti[ , pData(cdsMulti)$condition %in% c( "Ctrl", "Drug1" ) ]
                              fitFull1 <- fitNbinomGLMs(cdsMulti1, count ~ applicationTime + treatmentDuration * condition)
                              fitDrug1 <- fitNbinomGLMs(cdsMulti1, count ~ applicationTime + treatmentDuration + condition)
                              But when looking at the structure of fitFull1, everything that has to do with condition Drug1 is NA. Is this to be expected?? Or is my design wrong???

                              After that, when I continue with the test, I do the following:
                              Code:
                              pvalsGLM1 <- nbinomGLMTest(fitFull1, fitDrug1)
                              padjGLM1 <- p.adjust(pvalsGLM1, method="BH")
                              table(padjGLM1 <.1)
                              # FALSE  TRUE 
                              # 25696   239
                              degs1 <- rownames(counts(cdsMulti1))[ padjGLM1 < .1 ]
                              And when I have a look at the degs1, the non-NAs should be my "focus" gene IDs, is that right?

                              Thanks in advance,
                              Dimitra

                              Code:
                              sessionInfo()
                              R version 2.15.3 (2013-03-01)
                              Platform: x86_64-pc-linux-gnu (64-bit)
                              
                              locale:
                               [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
                               [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
                               [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
                               [7] LC_PAPER=C                 LC_NAME=C                 
                               [9] LC_ADDRESS=C               LC_TELEPHONE=C            
                              [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
                              
                              attached base packages:
                              [1] stats     graphics  grDevices utils     datasets  methods   base     
                              
                              other attached packages:
                               [1] FactoMineR_1.25            leaps_2.9                 
                               [3] scatterplot3d_0.3-33       cluster_1.14.4            
                               [5] lattice_0.20-15            ellipse_0.3-8             
                               [7] car_2.0-17                 nnet_7.3-6                
                               [9] MASS_7.3-23                PerformanceAnalytics_1.1.0
                              [11] xts_0.9-3                  zoo_1.7-9                 
                              [13] DESeq_1.8.3                locfit_1.5-9.1            
                              [15] Biobase_2.18.0             BiocGenerics_0.4.0        
                              [17] plotrix_3.4-7              RColorBrewer_1.0-5        
                              [19] scales_0.2.3               reshape2_1.2.2            
                              [21] plyr_1.8                   ggplot2_0.9.3.1           
                              
                              loaded via a namespace (and not attached):
                               [1] annotate_1.36.0      AnnotationDbi_1.20.7 colorspace_1.2-2    
                               [4] DBI_0.2-7            dichromat_2.0-0      digest_0.6.3        
                               [7] genefilter_1.40.0    geneplotter_1.36.0   grid_2.15.3         
                              [10] gtable_0.1.2         IRanges_1.16.6       KernSmooth_2.23-10  
                              [13] labeling_0.1         munsell_0.4          parallel_2.15.3     
                              [16] proto_0.3-10         RSQLite_0.11.3       splines_2.15.3      
                              [19] stats4_2.15.3        stringr_0.6.2        survival_2.37-4     
                              [22] tools_2.15.3         XML_3.96-1.1         xtable_1.7-1
                              Last edited by dimi; 06-12-2013, 06:59 AM.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              18 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              22 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              17 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X