SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
DESeq2 DataSetFromMatrix question MDonlin RNA Sequencing 12 02-16-2016 11:41 PM
RNAseq time series data w/ controls for each time point Mocca RNA Sequencing 2 08-07-2013 11:12 PM
CASIM: Variants Dealing with continual addition of data casim UK - Cambridge 0 04-18-2013 12:35 PM
Dealing with color-space RNA-seq data (SOLiD) luciaUY Bioinformatics 0 06-21-2012 09:15 AM
Dealing with Next Generation Data: AAAS Joann Literature Watch 5 02-17-2011 09:27 PM

Reply
 
Thread Tools
Old 03-05-2014, 12:31 AM   #1
liuse
Junior Member
 
Location: Dresden, Germany

Join Date: Mar 2014
Posts: 3
Default DESeq2 question in dealing with time course data.

Dear all,
I have a general question regarding my time course RNAseq data analysis. I am using a kind of worm which can regenerate the missing body part, and I would like to know the significantly changing genes and expression profiles during regeneration.

I have 11 time points: 0hr (as control), 1hr post injury, 2hr, 4hr, 6hr,....., until 48hr. The strategy in my mind is to do the pairwise comparison between "0hr and 1hr", "0hr and 2hr", "0hr and 4hr"......, "0hr and 48hr" using DESeq2 , and filter out those genes which don't show any change across all time points compared to 0hr.


My questions are:

1. Should I individually pre-extract the pairwise datasets (raw read count data) and do the DESeq2 test (dds function) separately? or I can do the DESeq2 test first based on whole raw read count data containing all time points and later extract the pairs I want (0hr vs 1hr, 0hr vs 2hr......). The confusing point is that since dds function estimates the size factor, calculate gene dispersion, create the model and test the model fitting, I wonder if using pre-extracted pair of raw read count data and using whole time points can result in different estimation and calculation (and maybe also padj due to the different testing sample size).


2. If I also want to analyze the post-calculating fold-change relatively to time 0 of differentially expressed genes from DESeq2 results (e.g. to do the clustering), should I use the providing fold-changes from dds&res function or use the variance-stabilizing transformation?


My goal to deal with those data is to first extract genes with at least one differential expression (e.g. use padj < 0.05) across all time points (filter out those showing no changes relative to control). Second, I want to apply the fold-change threshold (e.g. FC > 1.5 or FC < 0.5) to further extract highly changing gene profiles. Third, I want to use the padj&FC-extracted subset of profiles for clustering, GO-term enrichment test, and other analyses.


I am really new. So any opinion would be helpful.
Thank you.
liuse is offline   Reply With Quote
Old 03-06-2014, 10:26 AM   #2
Wolfgang Huber
Senior Member
 
Location: Heidelberg, Germany

Join Date: Aug 2009
Posts: 109
Default

Dear Luise

my advice here would be to put less emphasis on the testing (your point 1.) and move straight to point 2. (clustering), using one of the transformations described in the DESeq2 vignette to bring the data to a 'well-behaved' (log-like) scale.

To filter out the genes that vary not much, use the range (max-min) or IQR and a subjective cutoff (e.g. retain the top 20% of genes), then use standard clustering functions (e.g. pam from the cluster package), explore the temporal profiles of the clusters and other exploratory data analyses (e.g. PCA) to see the types of behaviours.

The cutoff choice is the one variable in here where you may be a bit unsure, but the same problem arises also with testing (you need e.g. an FDR cutoff), and in fact the best to overcome it is to just do it for a couple of different choices and see that the major temporal profiles are the same.

Hope this helps
Wolfgang
__________________
Wolfgang Huber
EMBL
Wolfgang Huber is offline   Reply With Quote
Old 03-06-2014, 12:41 PM   #3
liuse
Junior Member
 
Location: Dresden, Germany

Join Date: Mar 2014
Posts: 3
Default

Thank you very much for the reply Wolfgang.

BTW, I just realized I forgot to mention that I have biological duplicate data for each time point. If taking this into consideration, do you still think that I should not use the pairwise DESeq2 test (my point 1)?

If yes, could you maybe tell me more about why, or what you think that direct going into clustering is better?

I actually have done the individual pairwise test by
1)extract raw read counts of one pair (e.g. 0hr vs 1hr)
2)use DESeq2 differentially expressed test for each pair
3)set padj threshold to 0.05, and fold-change > 1.5 or < 0.66 for filtering
4)repeat step 1~3 for other pairs of data (e.g. 0hr vs 2hr; 0hr vs 4hr.......)
5)collect genes at least having one differentially expressed value (fold-change)
6)post-analyze those collected genes (I used the fold-changes provided from DESeq2 results to make the gene expression profile across time)

By doing these roughly 4k genes are remained (out of 25k from de novo assembled transcriptome) for post-analyses, which makes me think that most of the genes don't change so much (amplitude too low) or the changes are not reliable (padj too high). That's the major reason why I might want to keep using DESeq2 in my analytical pipeline.

Well, if you agree with my above points, then the question comes back to "how I should conduct the DESeq2 test to get proper analysis". Is the pre-extracting data strategy followed by DESeq2 test OK, or should I use whole data containing all time points for dds() test and then extract the paired results later? Can I use the fold-changes from DESeq2 for following shape-based analyses?


Really thank you for your patience.

Best,
Shang-Yun
liuse is offline   Reply With Quote
Old 03-07-2014, 05:33 AM   #4
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

hi Shang-Yun,

The choice of clustering and/or testing is up to you, there is not always one 'proper analysis'.

From your goals it sounds like clustering and looking for profiles was the main goal, and you don't need to remove genes which don't vary across time in order to perform clustering. However, it is a good idea to stabilize the variance as Wolfgang points out.

If you do want to do testing to find genes which change over time, I would recommend the likelihood ratio test which is explained in the vignette. This would test whether a design of '~ time' explains more than expected by chance than '~ 1', just using an intercept. In other words, this gives you the genes which have some statistically significant differences at any time point.
Michael Love is offline   Reply With Quote
Old 03-07-2014, 05:56 AM   #5
liuse
Junior Member
 
Location: Dresden, Germany

Join Date: Mar 2014
Posts: 3
Default

Hi Michael,
many thanks for your reply and suggestion. I'll for sure try the likelihood ratio test to see what I can get after that.

The purpose why I removed those genes which show no differential expression at any time point is because I do observe lots of noisy expression profiles, even after variance-stabilized transformation. If I keep those noisy profiles in the clustering analysis and GO-term enrichment, they clearly make the clustering result terrible, and increase the difficulty in GO-term enrichment test.

Thanks!!
liuse is offline   Reply With Quote
Old 11-21-2014, 05:52 AM   #6
Kwasi
Junior Member
 
Location: Cambridge, UK

Join Date: May 2014
Posts: 4
Default Back to the time course analysis question

Hi Michael and Wolfgang,

Apologies for going back to this old point but I have the same question as liuse. I have RNA-Seq time course data with controls for each time point and 3-4 biological replicates per group. I have already done the differential expression analysis and I got great results with biologically sensible patterns clearly evident across the time course.
My concern is that like liuse, I did my analysis by subsetting the different time points prior to calculating dispersions, and only looked at pairwise comparisons between control and treated samples from the same time period. One reason I felt this was justified was because there seemed to be an effect of time on gene expression as evinced by the attached image showing clustering of just the control samples.
However, I now have metadata on the dataset and I would like to see the effect of some of these other variables on gene expression using the likelihood ration test function. Unfortunately, if I want to be consistent I would have to run the LRT for each time point and then perhaps compare to find which genes are significant across all time points versus which genes which are time point-specific. The other times I have used the LRT, it has been with dispersion calculated on the whole dataset (I had no time point-specific controls at the time) so I'm not sure if doing individual time points is a valid approach with the LRT or if it would be better to calculate the dispersion on the whole dataset and compare relevant time points using contrasts and then proceed from there. What do you recommend?

Regards,
Kwasi
Attached Files
File Type: pdf PCA_controls_only.pdf (6.5 KB, 45 views)
Kwasi is offline   Reply With Quote
Old 11-21-2014, 05:58 AM   #7
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

We recommend running DESeq() on the whole dataset, and then using the contrast argument of results() to build specific results tables. We have an example of a basic analysis of a time-series experiment at the bottom of this workflow:

http://bioconductor.org/help/workflows/rnaseqGene/
Michael Love is offline   Reply With Quote
Old 11-21-2014, 07:38 AM   #8
Kwasi
Junior Member
 
Location: Cambridge, UK

Join Date: May 2014
Posts: 4
Default

Many thanks for the prompt response Michael!
Kwasi is offline   Reply With Quote
Old 12-29-2014, 12:54 PM   #9
kjaja
Member
 
Location: NY

Join Date: Aug 2011
Posts: 55
Default DESeq2

I read the example of a basic analysis of a time-series experiment at the bottom of this workflow:
http://bioconductor.org/help/workflows/rnaseqGene/

the fission data had replicates for both mutant and wild type, and was wondering in experiments with no replicates, how the script changes if we want to answer the same question?

thanks




Quote:
Originally Posted by Michael Love View Post
We recommend running DESeq() on the whole dataset, and then using the contrast argument of results() to build specific results tables. We have an example of a basic analysis of a time-series experiment at the bottom of this workflow:

http://bioconductor.org/help/workflows/rnaseqGene/
kjaja is offline   Reply With Quote
Old 01-02-2015, 01:09 AM   #10
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

It is possible to test for condition-specific changes over time without replicates with DESeq2 (or edgeR or limma). Wolfgang wrote up a nice description here: https://support.bioconductor.org/p/61884/#61895

For example, one can use the bs function from the splines library to form splines using the time variable: https://stat.ethz.ch/R-manual/R-deve...s/html/bs.html, and one can include an interaction of the spline and the condition, and use the likelihood ratio test with a reduced design which does not include the interaction term.

This requires additional R skills on the part of the user, so we recommend only for users who are familiar with time series analysis in R and fitting splines.

Last edited by Michael Love; 01-02-2015 at 02:25 AM.
Michael Love is offline   Reply With Quote
Old 01-13-2015, 12:40 PM   #11
jutos
Junior Member
 
Location: canada

Join Date: Jan 2010
Posts: 5
Default

Quote:
Originally Posted by Michael Love View Post
We recommend running DESeq() on the whole dataset, and then using the contrast argument of results() to build specific results tables. We have an example of a basic analysis of a time-series experiment at the bottom of this workflow:

http://bioconductor.org/help/workflows/rnaseqGene/
Hi Michael
Thanks for your suggestions! I am dealing with multiple conditions of RNAseq data and would like to identify differential expressed genes between four conditions. I used the contrast function as following:

results_condition1v2 <- results(ddsTime, contrast=c("condition","condition1","condition2"))

results_condition2v3 <- results(ddsTime, contrast=c("condition","condition2","condition3"))
......
However, I got the same baseMean, pvalue and padj values for different comparisons. Should I perform pair-wised comparisons and LRT separately for each pair?
jutos is offline   Reply With Quote
Old 01-14-2015, 05:49 AM   #12
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

hi jutos,

Can you post in a new thread, as your question doesn't seem to be about time series? It's helpful for the answer-ers to separate questions on different topics.

Can you say exactly what you want to compare? Do you want to compare all pairs of the 4, so B vs A, C vs A, etc.? If you can post the top of the results table, including the LFC information printed above the results table, that would help to explain things.
Michael Love is offline   Reply With Quote
Old 03-19-2015, 03:10 AM   #13
somi06
Junior Member
 
Location: Middle East

Join Date: Nov 2012
Posts: 3
Default visualisation of time-series exp.

Hi,

I am new to the forum and was wondering if i could ask additional question regarding time-series RNA-seq data.

I was just wondering if anyone could explain, or redirect me to where I can get more information about using spline to model the counts as a smooth function over time.

It is mentioned in http://www.bioconductor.org/help/workflows/rnaseqGene/ mentioned above.

"Another option would be to model the counts as a smooth function of time, and to include an interaction term of the condition with the smooth function. It is possible to build such a model using spline basis functions within R."

I looked briefly but could not find much information about it.

I am looking at change in transcripts over a day, with no baseline time point (although i use my first sampling point at 8am as time 0) and no control vs treatment either.

Thanks in advance
somi06 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:18 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO