amcloon 04-29-2013 08:46 AM

differential gene expression and variance issues
I have 5 time-points with 2 biological replicates (collected and prepared on separate days following exactly the same protocol) of bacteria during starvation-induced development. I've done analysis using CLC genome workbench and tophat-cufflinks-cuffdiff (yes, I realize I probably only need bowtie for bacteria, but I figured looking for nonexistent splice junctions would just take computational time and shouldn't change anything).

My problem is this; there are a number of genes that I know are differentially regulated (previously published, validated by me by qPCR) that go up by many fold (one example goes from 50 RPKM to like 4000) but that both programs say are not statistically significantly regulated because there is high variability between replicates.

Instead, the genes that are given as statistically significantly regulated are expressed at very low levels and don't have as much variability or a very high fold up-(or down) regulation (from 20 to 2 RPKM, for example). These seem less likely to be interesting biologically.

So my question is, am I going to be able to get anything statistically valid out of this data, or if there's a lot of variation am I just out of luck? I am sure I could just cherry-pick genes for future work, but that seems like a waste of data.

If I try DESeq, will I just have the same problem in a different format, or might the different ways the programs analyze the data change the way statistics are calculated?


Simon Anders 04-30-2013 04:00 AM

If you want to know whether DESeq will give you the same answer, you will just have to try.

As for the qPCR validation: Have you only validated that the gene goes up in one replicate, or have you also validated that the variance is low by performing your qPCR on the time points of the second replicate, too?

amcloon 04-30-2013 06:15 AM

I didn't do qPCR validation of the second data set, but if I do parallel analyses for each set of replicates (at least in the CLC software) I do see up-regulation of a number of known genes within each replicate set of timepoints. There is a bit of variation in timing, etc. but the genes I expect to go up do go up.

The problem comes when I try to do statistics, then the large variance in levels between the replicates makes the p-values really big for most of my "known" up-regulated genes.
I'm considering whether I need to do some sort of paired comparison, but then I'm not sure if I'll have to do separate analyses for each timepoint, comparing each timepoint to 0hrs, and then if I do that, do I have to make an even more severe significance correction if I'm effectively doing 4 separate tests...I wish I'd taken statistics more recently than 10 years ago.

On a partly unrelated note, the more I look through my data, the more I feel like cufflinks/cuffdiff is just not ideal for bacterial genomes. I feel like it doesn't deal well with the whole "many genes are in operons" issue. Has anyone else had experience with this and did you find something better?

And are there any programs that don't lump sense and antisense transcripts when counting reads mapping to a particular genomic region (also a somewhat bacteria-specific problem, I think)?

Simon Anders 04-30-2013 06:32 AM

Yes, when a paired analysis is warranted, it can have much more power than a naive one. Then, you have to use a tool like DESeq, because cuffdiff does not offer functionality for designs that go beyond a two-group comparison.

amcloon 04-30-2013 06:44 AM

Thanks, Simon, I'll give DESeq a try.

Illuminoid 05-07-2013 03:07 AM

Hi amcloon

I am interested in the outcome of your analysis with DEseq, since I have a similar issue with multiple timepoint analysis and variability between samples.

Did you end up using the paired analysis, or staying with single analyses comparing everything to time zero?



