SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
infinite fold change RNAseq biofreak Bioinformatics 6 07-02-2012 10:02 AM
DESeq DiffExpress Vs. Fold Change KellerMac Bioinformatics 3 06-10-2011 10:49 AM
fold change value-cuffdiff madsaan Bioinformatics 4 02-10-2011 06:51 AM
computing FOLD CHANGES bogdan RNA Sequencing 1 01-31-2011 01:50 PM
fold change for genes with 0 reads rnaseq General 2 11-04-2010 07:44 PM

Reply
 
Thread Tools
Old 11-03-2010, 11:36 AM   #1
rnaseq
Junior Member
 
Location: washington

Join Date: Dec 2009
Posts: 7
Default fold change for genes with 0 reads

I am comparing Illumina RNA-seq data sets for differentially expressed genes. A significant number of genes have 0 reads in one of the two conditions. What is the best way to include these genes? Is it acceptable to subsitute the 0 in the numerator or denominator with another number to get an approximate idea of the fold change?

Thanks!
rnaseq is offline   Reply With Quote
Old 11-03-2010, 01:13 PM   #2
lmf_bill
Member
 
Location: New Haven

Join Date: Jul 2008
Posts: 36
Default

you should be careful of these genes. In my points, you do not need calculate the fold change. You can split these cases into two situations: one condition is larger or smaller than threshold, e.g. gene RPKM>=5 (one Nature paper uses this scale). For the smaller, it is nothing, while the larger is significant different.
lmf_bill is offline   Reply With Quote
Old 11-03-2010, 01:46 PM   #3
rnaseq
Junior Member
 
Location: washington

Join Date: Dec 2009
Posts: 7
Default

Thanks for the reply lmf_bill!

I am doing a time-course comparison and typically the raw reads under one condition are 0 and in another condition are about 300 or higher so I know that there is at least a 300 fold increase in expression which is significant. I was wondering if there was a way to report this (approximate) numerical value rather than binning it as larger or smaller.

I appreciate any additional feedback!
rnaseq is offline   Reply With Quote
Old 11-03-2010, 03:51 PM   #4
jdanderson
Member
 
Location: Davis, CA

Join Date: Sep 2010
Posts: 45
Default

Hello all,

I am also very interested in everyone thinks about this as well. I've been debating on changing the zeros to ones to get some usable output, especially in cases where the denominator is zero (but not the numerator) because the output is undefined.

Any suggestions?

Cheers,
Johnathon
jdanderson is offline   Reply With Quote
Old 11-03-2010, 04:06 PM   #5
zhengz
Member
 
Location: sweden

Join Date: Aug 2010
Posts: 23
Default

What about zero-inflated count models?

I once checked the histograms of many variables and many did not look like follow Poisson distribution or negative binomial distribution, as there were excessive number of zoers.
zhengz is offline   Reply With Quote
Old 11-03-2010, 04:27 PM   #6
jdanderson
Member
 
Location: Davis, CA

Join Date: Sep 2010
Posts: 45
Default

Hello Zhengz,

Thank you for your reply. I think part of the problem with my approach to this matter is my general lack of statistical savvy. Could you perhaps provide a good starting point or primer (or a useful open source tool) that could help explain this type of modeling to those of us who are naive in this area? "Zero-inflated count models" sounds very interesting.

Cheers,
Johnathon
jdanderson is offline   Reply With Quote
Old 11-03-2010, 11:58 PM   #7
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

You don't need a "zero-inflated count model". The negative binomial (NB) model used, e.g., by edgeR, DESeq, BaySeq, is just fine. "Zero-inflated" means that you see zeroes more often than an NB (or Poisson) distribution would predict. This is an issue in some applications of statistics but I don't think we need it here.

For the question of reporting it: Well ,a fold change estimate of 300:0 is infinite, but if you don't like this, you can still say, it is ">300". In the end, what you want is some kind of confidence interval: If you have few counts, an observed count ratio of, say, 5:1, can be caused by a real expression ratio of anything from, say, 2:1 to 10:1, while, if the count ratio was 500:100, you can be way more sure that the estimate of 5:1 is close to the real value. Hence, a fold-change estimate should be supplemented by an indication of its precision (which, in RNA-Seq, would stringly depend on the number of reads the estimate is based on).

There have been a few methods to get such confidence intervals for microarrays, but, to my knowledge, nothing is available yet for RNA-Seq data.

Simon
Simon Anders is offline   Reply With Quote
Old 11-04-2010, 12:53 AM   #8
zhengz
Member
 
Location: sweden

Join Date: Aug 2010
Posts: 23
Default

It would be nice to see some references addressing when to use zero-inflated NB and when ordinary NB.

UCLA Statistical Computing is a very nice stitistics source (http://www.ats.ucla.edu/stat/stata/dae/zinb.htm).

"Some Strategies You Might Be Tempted To Try
Before we show how you can analyze this with a zero-inflated negative binomial analysis, let's consider some other methods that you might use.
OLS Regression - You could try to analyze these data using OLS regression. However, count data are highly non-normal and are not well estimated by OLS regression.
Zero-inflated Poisson Regression - Zero-inflated Poisson regression does better when the data is not overdispersed, i.e. variance much larger than the mean.
Ordinary Count Models - Poisson or negative binomial models might be more appropriate if there are not excess zeros. "


Hi Johnathon, unfortunately, I have not seen an R package for this. It may be "just fine" to use ordinary NB for some reviewers. I am definetely looking forward to seeing it be integreated in one of the open source packages. It would be even nicer to have the generlized estimated equation (GEE) or robust cluster variance function integreated with zero-inflated NB/Poisson models.
zhengz is offline   Reply With Quote
Old 11-04-2010, 01:33 AM   #9
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Hi

Quote:
Originally Posted by zhengz View Post
What about zero-inflated count models?

I once checked the histograms of many variables and many did not look like follow Poisson distribution or negative binomial distribution, as there were excessive number of zoers.
What kind of histogram did you look at?

Just to make sure: If you take all the count values for the different genes in one sample and plot a histogram of these, there is no reason for it to look like NB, and many zeroes are totally expected.

What I am talking about is the distribution of count values of one and the same gene in many different samples. This is what is expected to follow an NB. Unfortunately, this claim is made from theoretical reasoning, because no one has yet performed RNA-Seq experiments with hundreds of replicates, and with only a handful of numbers for each gene, you cannot estimate a distribution. Consequently, it would be hard to check whether there is an excess of zeroes, and hence, I suggest to assume there is not -- not only to keep thing simple but also because it is hard to see a mechanistic cause for a zero excess.

Simon
Simon Anders is offline   Reply With Quote
Old 11-04-2010, 07:43 AM   #10
mrawlins
Member
 
Location: Retirement - Not working with bioinformatics anymore.

Join Date: Apr 2010
Posts: 63
Default

What we've used for time-course models is doing a pair-wise comparison of each time point with the control (time 0) after quantile normalization. We've had a lot of success with doing a Fisher Exact Test for our pair-wise comparisons (it's faster to do a chi-square approximation of the FET, but it isn't quite as good). For Illumina and SOLiD reads the hypergeometric distribution that the FET is based on describes the data really well after quantile normalization. Before normalization it doesn't work as well, and more length bias remains. In some samples we've seen a complete elimination of the length bias using this method, FWIW. The method has no problems determining if a zero-count sample is different from an n-count sample.

We also find this method needs some multiple hypothesis testing correction. We've had pretty good luck with the Bonferroni (Sidak) correction despite the fact it's really conservative. FDRs are commonly used in microarray data, and would probably work better than Bonferroni, but are a little more involved to calculate.

This and other methods for pair-wise comparison can be found in Bullard et al. (2010) in BMC Bioinformatics. I only tweaked one of their methods to get ours.
mrawlins is offline   Reply With Quote
Old 11-04-2010, 08:56 AM   #11
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Quote:
Originally Posted by mrawlins View Post
This and other methods for pair-wise comparison can be found in Bullard et al. (2010) in BMC Bioinformatics. I only tweaked one of their methods to get ours.
The problem with this paper is that they don't discuss whether the compared methods control type-I error correctly, or to be more precise: The paper assumes that you want to test whether the expression strength in two samples is different, but usually, you rather want to test whether the change may be attributed to the difference in experimental treatment. (Ok, I know, I'm repeating myself.)

Simon
Simon Anders is offline   Reply With Quote
Old 11-04-2010, 08:59 AM   #12
chqilin
Junior Member
 
Location: Canada

Join Date: Jul 2010
Posts: 1
Default

I am fascinated by the comments posted here. I would thank everybody who helps later when I have question about NGS, because you are the professionals.
chqilin is offline   Reply With Quote
Old 11-04-2010, 09:15 AM   #13
rnaseq
Junior Member
 
Location: washington

Join Date: Dec 2009
Posts: 7
Default

Hi Simon Anders,

Thank you very much for the insight and for clarifying things. I think you've addressed all the follow up questions I had about the zero-inflated model. My inclination was to report these genes as >300 (0 reads vs 300 reads). I agree it might be better to indicate the precision as well.

I appreciate all the feedback.
rnaseq is offline   Reply With Quote
Old 11-04-2010, 12:34 PM   #14
zhengz
Member
 
Location: sweden

Join Date: Aug 2010
Posts: 23
Default

It was microbiota composition data - histograms of each OTU (suppose this is the same as what you referred to as each gene) for groups of tens of samples. (OK, not RNA-seq data) But I think for people doing 16S rRNA, an R package with zero-inflated model would be useful.

In cases of just a few replicates, certainly no way to check zero excess. But why use NB then (if this was what you meant to do)? Isn't there a simpler way?
zhengz is offline   Reply With Quote
Old 11-04-2010, 07:20 PM   #15
malachig
Senior Member
 
Location: WashU

Join Date: Aug 2010
Posts: 117
Default

Dealing with 0 counts in comparisons across conditions was a very common topic of debate in the good old days of SAGE analysis. Perhaps there are some insights in that literature. Thankfully I didn't have to analyze that data but perhaps a SAGE aficionado will jump in here...
malachig is offline   Reply With Quote
Old 11-05-2010, 01:23 AM   #16
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Quote:
Originally Posted by zhengz View Post
In cases of just a few replicates, certainly no way to check zero excess. But why use NB then (if this was what you meant to do)? Isn't there a simpler way?
Not really. A lot of people assume a Poisson distribution (implicitly, by using Fisher's exact test) and then, for genes with high count rates, everything looks significant. You simply need to assess the biological variance (and for few replicates, you can only do so by pooling information over genes) and add it to the shot noise from the Poisson. NB is certainly the simplest way to to that.

Simon
Simon Anders is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:54 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO