Seqanswers Leaderboard Ad

**Wolfgang Huber** · 03-06-2014, 11:26 AM

Dear Luise

my advice here would be to put less emphasis on the testing (your point 1.) and move straight to point 2. (clustering), using one of the transformations described in the DESeq2 vignette to bring the data to a 'well-behaved' (log-like) scale.

To filter out the genes that vary not much, use the range (max-min) or IQR and a subjective cutoff (e.g. retain the top 20% of genes), then use standard clustering functions (e.g. pam from the cluster package), explore the temporal profiles of the clusters and other exploratory data analyses (e.g. PCA) to see the types of behaviours.

The cutoff choice is the one variable in here where you may be a bit unsure, but the same problem arises also with testing (you need e.g. an FDR cutoff), and in fact the best to overcome it is to just do it for a couple of different choices and see that the major temporal profiles are the same.

Hope this helps
Wolfgang

**liuse** · 03-06-2014, 01:41 PM

Thank you very much for the reply Wolfgang.

BTW, I just realized I forgot to mention that I have biological duplicate data for each time point. If taking this into consideration, do you still think that I should not use the pairwise DESeq2 test (my point 1)?

If yes, could you maybe tell me more about why, or what you think that direct going into clustering is better?

I actually have done the individual pairwise test by
1)extract raw read counts of one pair (e.g. 0hr vs 1hr)
2)use DESeq2 differentially expressed test for each pair
3)set padj threshold to 0.05, and fold-change > 1.5 or < 0.66 for filtering
4)repeat step 1~3 for other pairs of data (e.g. 0hr vs 2hr; 0hr vs 4hr.......)
5)collect genes at least having one differentially expressed value (fold-change)
6)post-analyze those collected genes (I used the fold-changes provided from DESeq2 results to make the gene expression profile across time)

By doing these roughly 4k genes are remained (out of 25k from de novo assembled transcriptome) for post-analyses, which makes me think that most of the genes don't change so much (amplitude too low) or the changes are not reliable (padj too high). That's the major reason why I might want to keep using DESeq2 in my analytical pipeline.

Well, if you agree with my above points, then the question comes back to "how I should conduct the DESeq2 test to get proper analysis". Is the pre-extracting data strategy followed by DESeq2 test OK, or should I use whole data containing all time points for dds() test and then extract the paired results later? Can I use the fold-changes from DESeq2 for following shape-based analyses?

Really thank you for your patience.

Best,
Shang-Yun

**Michael Love** · 03-07-2014, 06:33 AM

hi Shang-Yun,

The choice of clustering and/or testing is up to you, there is not always one 'proper analysis'.

From your goals it sounds like clustering and looking for profiles was the main goal, and you don't need to remove genes which don't vary across time in order to perform clustering. However, it is a good idea to stabilize the variance as Wolfgang points out.

If you do want to do testing to find genes which change over time, I would recommend the likelihood ratio test which is explained in the vignette. This would test whether a design of '~ time' explains more than expected by chance than '~ 1', just using an intercept. In other words, this gives you the genes which have some statistically significant differences at any time point.

**liuse** · 03-07-2014, 06:56 AM

Hi Michael,
many thanks for your reply and suggestion. I'll for sure try the likelihood ratio test to see what I can get after that.

The purpose why I removed those genes which show no differential expression at any time point is because I do observe lots of noisy expression profiles, even after variance-stabilized transformation. If I keep those noisy profiles in the clustering analysis and GO-term enrichment, they clearly make the clustering result terrible, and increase the difficulty in GO-term enrichment test.

Thanks!!

**Kwasi** · 11-21-2014, 06:52 AM

Back to the time course analysis question

Hi Michael and Wolfgang,

Apologies for going back to this old point but I have the same question as liuse. I have RNA-Seq time course data with controls for each time point and 3-4 biological replicates per group. I have already done the differential expression analysis and I got great results with biologically sensible patterns clearly evident across the time course.
My concern is that like liuse, I did my analysis by subsetting the different time points prior to calculating dispersions, and only looked at pairwise comparisons between control and treated samples from the same time period. One reason I felt this was justified was because there seemed to be an effect of time on gene expression as evinced by the attached image showing clustering of just the control samples.
However, I now have metadata on the dataset and I would like to see the effect of some of these other variables on gene expression using the likelihood ration test function. Unfortunately, if I want to be consistent I would have to run the LRT for each time point and then perhaps compare to find which genes are significant across all time points versus which genes which are time point-specific. The other times I have used the LRT, it has been with dispersion calculated on the whole dataset (I had no time point-specific controls at the time) so I'm not sure if doing individual time points is a valid approach with the LRT or if it would be better to calculate the dispersion on the whole dataset and compare relevant time points using contrasts and then proceed from there. What do you recommend?

Regards,
Kwasi

Attached Files

PCA_controls_only.pdf (6.5 KB, 45 views)

**Michael Love** · 11-21-2014, 06:58 AM

We recommend running DESeq() on the whole dataset, and then using the contrast argument of results() to build specific results tables. We have an example of a basic analysis of a time-series experiment at the bottom of this workflow:

rnaseqGene

http://bioconductor.org/help/workflows/rnaseqGene/

Here we walk through an end-to-end gene-level RNA-seq differential expression workflow using Bioconductor packages. We will start from the FASTQ files, show how these were aligned to the reference genome, and prepare a count matrix which tallies the number of RNA-seq reads/fragments within each gene for each sample. We will perform exploratory data analysis (EDA) for quality assessment and to explore the relationship between samples, perform differential gene expression analysis, and visually explore the results.

**Kwasi** · 11-21-2014, 08:38 AM

Many thanks for the prompt response Michael!

**kjaja** · 12-29-2014, 01:54 PM

DESeq2

I read the example of a basic analysis of a time-series experiment at the bottom of this workflow:

rnaseqGene

http://bioconductor.org/help/workflows/rnaseqGene/

Here we walk through an end-to-end gene-level RNA-seq differential expression workflow using Bioconductor packages. We will start from the FASTQ files, show how these were aligned to the reference genome, and prepare a count matrix which tallies the number of RNA-seq reads/fragments within each gene for each sample. We will perform exploratory data analysis (EDA) for quality assessment and to explore the relationship between samples, perform differential gene expression analysis, and visually explore the results.

the fission data had replicates for both mutant and wild type, and was wondering in experiments with no replicates, how the script changes if we want to answer the same question?

thanks

Originally posted by Michael Love View Post

We recommend running DESeq() on the whole dataset, and then using the contrast argument of results() to build specific results tables. We have an example of a basic analysis of a time-series experiment at the bottom of this workflow:

http://bioconductor.org/help/workflows/rnaseqGene/

**Michael Love** · 01-02-2015, 02:09 AM

It is possible to test for condition-specific changes over time without replicates with DESeq2 (or edgeR or limma). Wolfgang wrote up a nice description here: https://support.bioconductor.org/p/61884/#61895

For example, one can use the bs function from the splines library to form splines using the time variable: https://stat.ethz.ch/R-manual/R-deve...s/html/bs.html, and one can include an interaction of the spline and the condition, and use the likelihood ratio test with a reduced design which does not include the interaction term.

This requires additional R skills on the part of the user, so we recommend only for users who are familiar with time series analysis in R and fitting splines.

**jutos** · 01-13-2015, 01:40 PM

Originally posted by Michael Love View Post

We recommend running DESeq() on the whole dataset, and then using the contrast argument of results() to build specific results tables. We have an example of a basic analysis of a time-series experiment at the bottom of this workflow:

http://bioconductor.org/help/workflows/rnaseqGene/

Hi Michael
Thanks for your suggestions! I am dealing with multiple conditions of RNAseq data and would like to identify differential expressed genes between four conditions. I used the contrast function as following:

results_condition1v2 <- results(ddsTime, contrast=c("condition","condition1","condition2"))

results_condition2v3 <- results(ddsTime, contrast=c("condition","condition2","condition3"))
......
However, I got the same baseMean, pvalue and padj values for different comparisons. Should I perform pair-wised comparisons and LRT separately for each pair?

**Michael Love** · 01-14-2015, 06:49 AM

hi jutos,

Can you post in a new thread, as your question doesn't seem to be about time series? It's helpful for the answer-ers to separate questions on different topics.

Can you say exactly what you want to compare? Do you want to compare all pairs of the 4, so B vs A, C vs A, etc.? If you can post the top of the results table, including the LFC information printed above the results table, that would help to explain things.

**somi06** · 03-19-2015, 03:10 AM

visualisation of time-series exp.

Hi,

I am new to the forum and was wondering if i could ask additional question regarding time-series RNA-seq data.

I was just wondering if anyone could explain, or redirect me to where I can get more information about using spline to model the counts as a smooth function over time.

It is mentioned in http://www.bioconductor.org/help/workflows/rnaseqGene/ mentioned above.

"Another option would be to model the counts as a smooth function of time, and to include an interaction term of the condition with the smooth function. It is possible to build such a model using spline basis functions within R."

I looked briefly but could not find much information about it.

I am looking at change in transcripts over a day, with no baseline time point (although i use my first sampling point at 8am as time 0) and no control vs treatment either.

Thanks in advance

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 24 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 20 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

DESeq2 question in dealing with time course data.

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News