Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • DESeq2 question in dealing with time course data.

    Dear all,
    I have a general question regarding my time course RNAseq data analysis. I am using a kind of worm which can regenerate the missing body part, and I would like to know the significantly changing genes and expression profiles during regeneration.

    I have 11 time points: 0hr (as control), 1hr post injury, 2hr, 4hr, 6hr,....., until 48hr. The strategy in my mind is to do the pairwise comparison between "0hr and 1hr", "0hr and 2hr", "0hr and 4hr"......, "0hr and 48hr" using DESeq2 , and filter out those genes which don't show any change across all time points compared to 0hr.


    My questions are:

    1. Should I individually pre-extract the pairwise datasets (raw read count data) and do the DESeq2 test (dds function) separately? or I can do the DESeq2 test first based on whole raw read count data containing all time points and later extract the pairs I want (0hr vs 1hr, 0hr vs 2hr......). The confusing point is that since dds function estimates the size factor, calculate gene dispersion, create the model and test the model fitting, I wonder if using pre-extracted pair of raw read count data and using whole time points can result in different estimation and calculation (and maybe also padj due to the different testing sample size).


    2. If I also want to analyze the post-calculating fold-change relatively to time 0 of differentially expressed genes from DESeq2 results (e.g. to do the clustering), should I use the providing fold-changes from dds&res function or use the variance-stabilizing transformation?


    My goal to deal with those data is to first extract genes with at least one differential expression (e.g. use padj < 0.05) across all time points (filter out those showing no changes relative to control). Second, I want to apply the fold-change threshold (e.g. FC > 1.5 or FC < 0.5) to further extract highly changing gene profiles. Third, I want to use the padj&FC-extracted subset of profiles for clustering, GO-term enrichment test, and other analyses.


    I am really new. So any opinion would be helpful.
    Thank you.

  • #2
    Dear Luise

    my advice here would be to put less emphasis on the testing (your point 1.) and move straight to point 2. (clustering), using one of the transformations described in the DESeq2 vignette to bring the data to a 'well-behaved' (log-like) scale.

    To filter out the genes that vary not much, use the range (max-min) or IQR and a subjective cutoff (e.g. retain the top 20% of genes), then use standard clustering functions (e.g. pam from the cluster package), explore the temporal profiles of the clusters and other exploratory data analyses (e.g. PCA) to see the types of behaviours.

    The cutoff choice is the one variable in here where you may be a bit unsure, but the same problem arises also with testing (you need e.g. an FDR cutoff), and in fact the best to overcome it is to just do it for a couple of different choices and see that the major temporal profiles are the same.

    Hope this helps
    Wolfgang
    Wolfgang Huber
    EMBL

    Comment


    • #3
      Thank you very much for the reply Wolfgang.

      BTW, I just realized I forgot to mention that I have biological duplicate data for each time point. If taking this into consideration, do you still think that I should not use the pairwise DESeq2 test (my point 1)?

      If yes, could you maybe tell me more about why, or what you think that direct going into clustering is better?

      I actually have done the individual pairwise test by
      1)extract raw read counts of one pair (e.g. 0hr vs 1hr)
      2)use DESeq2 differentially expressed test for each pair
      3)set padj threshold to 0.05, and fold-change > 1.5 or < 0.66 for filtering
      4)repeat step 1~3 for other pairs of data (e.g. 0hr vs 2hr; 0hr vs 4hr.......)
      5)collect genes at least having one differentially expressed value (fold-change)
      6)post-analyze those collected genes (I used the fold-changes provided from DESeq2 results to make the gene expression profile across time)

      By doing these roughly 4k genes are remained (out of 25k from de novo assembled transcriptome) for post-analyses, which makes me think that most of the genes don't change so much (amplitude too low) or the changes are not reliable (padj too high). That's the major reason why I might want to keep using DESeq2 in my analytical pipeline.

      Well, if you agree with my above points, then the question comes back to "how I should conduct the DESeq2 test to get proper analysis". Is the pre-extracting data strategy followed by DESeq2 test OK, or should I use whole data containing all time points for dds() test and then extract the paired results later? Can I use the fold-changes from DESeq2 for following shape-based analyses?


      Really thank you for your patience.

      Best,
      Shang-Yun

      Comment


      • #4
        hi Shang-Yun,

        The choice of clustering and/or testing is up to you, there is not always one 'proper analysis'.

        From your goals it sounds like clustering and looking for profiles was the main goal, and you don't need to remove genes which don't vary across time in order to perform clustering. However, it is a good idea to stabilize the variance as Wolfgang points out.

        If you do want to do testing to find genes which change over time, I would recommend the likelihood ratio test which is explained in the vignette. This would test whether a design of '~ time' explains more than expected by chance than '~ 1', just using an intercept. In other words, this gives you the genes which have some statistically significant differences at any time point.

        Comment


        • #5
          Hi Michael,
          many thanks for your reply and suggestion. I'll for sure try the likelihood ratio test to see what I can get after that.

          The purpose why I removed those genes which show no differential expression at any time point is because I do observe lots of noisy expression profiles, even after variance-stabilized transformation. If I keep those noisy profiles in the clustering analysis and GO-term enrichment, they clearly make the clustering result terrible, and increase the difficulty in GO-term enrichment test.

          Thanks!!

          Comment


          • #6
            Back to the time course analysis question

            Hi Michael and Wolfgang,

            Apologies for going back to this old point but I have the same question as liuse. I have RNA-Seq time course data with controls for each time point and 3-4 biological replicates per group. I have already done the differential expression analysis and I got great results with biologically sensible patterns clearly evident across the time course.
            My concern is that like liuse, I did my analysis by subsetting the different time points prior to calculating dispersions, and only looked at pairwise comparisons between control and treated samples from the same time period. One reason I felt this was justified was because there seemed to be an effect of time on gene expression as evinced by the attached image showing clustering of just the control samples.
            However, I now have metadata on the dataset and I would like to see the effect of some of these other variables on gene expression using the likelihood ration test function. Unfortunately, if I want to be consistent I would have to run the LRT for each time point and then perhaps compare to find which genes are significant across all time points versus which genes which are time point-specific. The other times I have used the LRT, it has been with dispersion calculated on the whole dataset (I had no time point-specific controls at the time) so I'm not sure if doing individual time points is a valid approach with the LRT or if it would be better to calculate the dispersion on the whole dataset and compare relevant time points using contrasts and then proceed from there. What do you recommend?

            Regards,
            Kwasi
            Attached Files

            Comment


            • #7
              We recommend running DESeq() on the whole dataset, and then using the contrast argument of results() to build specific results tables. We have an example of a basic analysis of a time-series experiment at the bottom of this workflow:

              Here we walk through an end-to-end gene-level RNA-seq differential expression workflow using Bioconductor packages. We will start from the FASTQ files, show how these were aligned to the reference genome, and prepare a count matrix which tallies the number of RNA-seq reads/fragments within each gene for each sample. We will perform exploratory data analysis (EDA) for quality assessment and to explore the relationship between samples, perform differential gene expression analysis, and visually explore the results.

              Comment


              • #8
                Many thanks for the prompt response Michael!

                Comment


                • #9
                  DESeq2

                  I read the example of a basic analysis of a time-series experiment at the bottom of this workflow:
                  Here we walk through an end-to-end gene-level RNA-seq differential expression workflow using Bioconductor packages. We will start from the FASTQ files, show how these were aligned to the reference genome, and prepare a count matrix which tallies the number of RNA-seq reads/fragments within each gene for each sample. We will perform exploratory data analysis (EDA) for quality assessment and to explore the relationship between samples, perform differential gene expression analysis, and visually explore the results.


                  the fission data had replicates for both mutant and wild type, and was wondering in experiments with no replicates, how the script changes if we want to answer the same question?

                  thanks




                  Originally posted by Michael Love View Post
                  We recommend running DESeq() on the whole dataset, and then using the contrast argument of results() to build specific results tables. We have an example of a basic analysis of a time-series experiment at the bottom of this workflow:

                  http://bioconductor.org/help/workflows/rnaseqGene/

                  Comment


                  • #10
                    It is possible to test for condition-specific changes over time without replicates with DESeq2 (or edgeR or limma). Wolfgang wrote up a nice description here: https://support.bioconductor.org/p/61884/#61895

                    For example, one can use the bs function from the splines library to form splines using the time variable: https://stat.ethz.ch/R-manual/R-deve...s/html/bs.html, and one can include an interaction of the spline and the condition, and use the likelihood ratio test with a reduced design which does not include the interaction term.

                    This requires additional R skills on the part of the user, so we recommend only for users who are familiar with time series analysis in R and fitting splines.
                    Last edited by Michael Love; 01-02-2015, 03:25 AM.

                    Comment


                    • #11
                      Originally posted by Michael Love View Post
                      We recommend running DESeq() on the whole dataset, and then using the contrast argument of results() to build specific results tables. We have an example of a basic analysis of a time-series experiment at the bottom of this workflow:

                      http://bioconductor.org/help/workflows/rnaseqGene/
                      Hi Michael
                      Thanks for your suggestions! I am dealing with multiple conditions of RNAseq data and would like to identify differential expressed genes between four conditions. I used the contrast function as following:

                      results_condition1v2 <- results(ddsTime, contrast=c("condition","condition1","condition2"))

                      results_condition2v3 <- results(ddsTime, contrast=c("condition","condition2","condition3"))
                      ......
                      However, I got the same baseMean, pvalue and padj values for different comparisons. Should I perform pair-wised comparisons and LRT separately for each pair?

                      Comment


                      • #12
                        hi jutos,

                        Can you post in a new thread, as your question doesn't seem to be about time series? It's helpful for the answer-ers to separate questions on different topics.

                        Can you say exactly what you want to compare? Do you want to compare all pairs of the 4, so B vs A, C vs A, etc.? If you can post the top of the results table, including the LFC information printed above the results table, that would help to explain things.

                        Comment


                        • #13
                          visualisation of time-series exp.

                          Hi,

                          I am new to the forum and was wondering if i could ask additional question regarding time-series RNA-seq data.

                          I was just wondering if anyone could explain, or redirect me to where I can get more information about using spline to model the counts as a smooth function over time.

                          It is mentioned in http://www.bioconductor.org/help/workflows/rnaseqGene/ mentioned above.

                          "Another option would be to model the counts as a smooth function of time, and to include an interaction term of the condition with the smooth function. It is possible to build such a model using spline basis functions within R."

                          I looked briefly but could not find much information about it.

                          I am looking at change in transcripts over a day, with no baseline time point (although i use my first sampling point at 8am as time 0) and no control vs treatment either.

                          Thanks in advance

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Current Approaches to Protein Sequencing
                            by seqadmin


                            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                            04-04-2024, 04:25 PM
                          • seqadmin
                            Strategies for Sequencing Challenging Samples
                            by seqadmin


                            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                            03-22-2024, 06:39 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, 04-11-2024, 12:08 PM
                          0 responses
                          22 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 10:19 PM
                          0 responses
                          24 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 09:21 AM
                          0 responses
                          20 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-04-2024, 09:00 AM
                          0 responses
                          52 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X