Gene expression levels – underlying distribution
Hi there,
I am trying to model the expression level of genes using RNASeq. I was wondering if there is any statistical analysis that identified the underlying distribution of the expressions for next generation sequencing. In the microarry days, it was assumed that they are Normally distributed, in other words: the expression levels can be synthetically made using a Gaussian distribution. Is there any study out there that specifies the same thing for next generation sequencing? Appreciate any help, link to the related papers or hints. Thanks! 
I am not aware of any software, but I thought you could start of with QQ plots with various common distributions. Then I would fit it to that distribution and observe the chisquare value to see if your expression values follow a pattern and at what level.

Several methods model expression levels using the negative binomial distribution  see e.g. DESeq, edgeR, baySeq ...

Thank you so much for your replies!
Actually, I am not looking for a software. I am more interested in a theoretical work which studied properties of data generated by RNASeq. I will use the QQ plots for my samples though, thanks! About DESeq, I looked at the paper. It is interesting and certainly in the direction that I needed; but I wish it was more validations for the assumption of negative binomial distribution. Maybe more biological validations ... 
Actually there is less to it than you may think. Let me add some remarks, using the notation from our paper:
If the gene of interest has a concentration q in your prepared library, the expectation value for the counts from this gene is proportional to q. Let's fix the units for the concentration such that the expectation is sq, where s is the size factor for this sample. From the mechanics of highthroughput sequencing, it can be seen by theoretical considerations that the distribution of counts K for the gene, given the concentration q, is Poisson distributed (with parameter sq.) Let's say that the distribution of concentrations Q over biological replicates follows some density with mean µ and variance v=aµ˛. Hence, the variance of K, conditioned on Q=q, is sq, but once you marginalize out the condition, you get, by the law of total variance Var K = sq + s˛aµ˛ Now, we have to choose a distribution for Q. Purely out of mathematical convenience, we postulate it to follow a gamma distribution, because then, K will follow a negative binomial distribution, which is easy to handle. We have no evidence that the gamma is the right distribution. In microarray times, people tended to rather assume a lognormal. To my knowledge, there is no data set out there with enough samples to make a QQ plot to check which of these two might be more appropriate. However, simulation shows that misspecification of the distribution for Q does not cause the results too change much, i.e., it does not really matter whether we use lognormal or gamma or something else of roughly this shape. (Sorry, can't find the references for this now; ask again if you need them.) From microarrays, where we have data sets with enough samples, we know that the shape of these distributions is not completely off, and which of them to use does not seem to matter too much. Hence, everybody goes for negative binomial. This is, by the way, not a new idea: in many other branches of statistics that deal with count data, the NB is used since decades whenever overdispersion is present. 
Thank you Simon for your reply!
I do see your point and I liked you paper a lot too. 
Hi Simon, can I have the link of your paper?
Apart from that, do you have any program recommend in order to identify gene expression level of RNAseq? Thanks first. 
Quote:
Anders S, Huber W.: Differential expression analysis for sequence count data. Genome Biology, 2010, 11:R106. Quote:

Hi simon,
Below is the background of my RNAseq data: I got one set of human kidney sample which is sequenced by Illumina, 2X50bp, pairend data. I use RNAseq short read assembler (Trinity) to assemble the RNAseq and it generates around 20,000 contig (transcript). After obtain the above assembly transcript, I would like to know its expression level. Do you mean that I should use cufflinks to obtain an accurate estimate of expression for kidney tissue? If I got one set of assembler transcript of kidney and breast tissue, I only use your DESeq for compare expression between kidney and breast tissue, am I right? Thanks. 
Quote:
I am trying to understand the proposed model. After reading your paper and your comments here, now I have some questions: The real gene concentration is q, from distribution Q, which is selected to be gamma. Now, here: “the distribution of counts K for the gene, given the concentration q, is Poisson distributed (with parameter sq.)” K is assumed to be a Poisson. Now my question 1 is: 1 K models number of reads mapped to the specific gene, right? 2 If the answer for question 1 is yes, then, how the K becomes negative binomial after conditioning on Q and selecting Q as gamma? “Now, we have to choose a distribution for Q. Purely out of mathematical convenience, we postulate it to follow a gamma distribution, because then, K will follow a negative binomial distribution” I know that the poisson with mean from gamma gives you negative binomial, but isn’t it true that K is already assumed to be a poisson? So finally what is the distribution of number of reads mapped to a gene? 3 If K is indeed negative binomial, what is the definition of the parameters of a usual negative binomial here? I mean how the “p” and especially “r” are defined? What does it mean to repeat the experiments until the failure happens (definition of negative binomial). This is important because in the mapping each reads will successes or fails to map to a gene, but we don’t have to try until we get certain number of fails or success. Therefore, reasoning for the K to be negative binomial and modeling the reads mapped to a gene, is confusing. It would be highly appreciated if you can explain the meaning of the random variables and concepts in above questions. 
Somehow, you managed to get everything the wrong way round:
Quote:
Let's start again. There are two ways how the NB comes about. (a) the definition that you cite, namely the waiting time until r failures happen. This definition is not the general one, as it makes sense only for integer r, and r may be noninteger, too. Also, we do not repeat anything until failure here. (b) as marginal distribution of a gammaPoisson hierarchical model. The latter is what we use. Let Q be a random variable with gamma distribution with expectation q0 and variance a q0˛. We draw a random variate q for Q and then set up a random variable Kq, which is Poisson distributed with mean q (and hence also variance q). If we marginalize out the conditioning of Kq on q, what do we get? Integrate the product of the pdf's for KQ and Q, f_pois(k; q) and f_gamma( q; q0, a ), for q from 0 to infinity, and you get the pdf of the NB with mean q0 and dispersion a (i.e., variance q0 + a q0˛). Note that this has nothing to do with waiting times in Bernoulli processes  only if r=1/a happens to be integer, it happens to be the same distribution as you get when waiting for r failures in a Bernoulli process with a success probability p such that p = pr/(1p). 
Thanks!
In simple words, KQ ~ Poisson, meaning that the number of reads mapped to a gene, conditioned on the actual concentration for the gene, is Poisson. If we marginalize over concentrations, then the K: number of reads mapped to a gene, is distributed by negative binomial. In the cases that we don’t have all the actual concentrations, we can assume that the reads mapped to a gene is negative binomial then. Is that right? 
Exactly. And of course we never have the actual concentrations because if we did we wouldn't need to measure them by sequencing.

Dear Simon (or anyone else who wishes to comment),
I have a (more general) question related to your excellent description above: I am wondering about the differences between the negative binomial and Poisson lognormal models for analyzing count data. It appears that most NGS read count software takes the negative binomial approach (DESeq, edgeR, baySeq). However, it is my understanding that a Poisson lognormal model could also be suitable for analyzing this data. If I understand correctly the only difference between the models is that Q (using the notation above) is lognormal rather than gamma distributed (in the Poisson lognormal model). So my question is: Is there any particular reason that people have chosen the negative binomial model instead of Poisson lognormal for read count data? Thanks! 
As I see it from some previous posts, the choice for distribution of Q won’t affect the distribution of K (the # of mapped reads) to be modeled as negative binomial:
"However, simulation shows that misspecification of the distribution for Q does not cause the results too change much, i.e., it does not really matter whether we use lognormal or gamma or something else of roughly this shape. (Sorry, can't find the references for this now; ask again if you need them.)" I appreciate if you can let me know about any references that you were referring to in that message. Thanks again for all your explanations :) 
Quote:
I would still be curious, though, to find out whether there is any particular reason everyone has chosen negative binomial, or whether it is a historical accident of sorts. If it truly doesn't really matter whether you use lognormal or gamma for the distribution of Q, then I would expect that at least someone, somewhere would have published a Poisson lognormal model. 
Quote:
Firth D: Multiplicative Errors: Lognormal or Gamma? J. R. Statist. Soc. B. 1988;50:266268. (But I don't have access to it at the moment, so I can't doublecheck whether it really supports my claim. I simply hope so. ;) ) This one here might be relevant, too: Jun Lu, John K Tomfohr and Thomas B Kepler: BMC Bioinformatics 2005, 6:165 doi:10.1186/147121056165 Simon 
Thanks Simon! I'll check them out!

Hi Simon I have some things confused in mmy mind and I would need some help
You say: "Let Q be a random variable with gamma distribution with expectation q0 and variance a q0˛" If I understand well q is the shape and O the scale parameter of the Gamma. How did u end up to the conclusion that the distribution of Q is coming with these parameters? Also if I understand well then the P(Kq) follows poison with mean,var=q. How can one get again to this result?? If my question is to naive to explain maybe u can suggest some reading (bayesian statistics for dummies or sthng :rolleyes:) 
There is no letter 'O'. This is a zero. I wanted to write 'q0' with zero as subscript but did not manage. (Don't know if the forum software allows subscripts.) So, 'q0' is one value (and distinct from q, which I use further down for the realization of the random variable Q) and 'a' is the other one.
I specified the first two moments. Solving the expressions for them for the shape and scale parameter, you get for the scale a q0 and for the shape 1/a. By the way, this is all frequentist statistics. There is nothing Bayesian in here. 
All times are GMT 8. The time now is 05:32 AM. 
Powered by vBulletin® Version 3.8.9
Copyright ©2000  2020, vBulletin Solutions, Inc.