NGHT 04-18-2011 12:02 PM

Gene expression levels – underlying distribution

Hi there,

I am trying to model the expression level of genes using RNA-Seq. I was wondering if there is any statistical analysis that identified the underlying distribution of the expressions for next generation sequencing. In the microarry days, it was assumed that they are Normally distributed, in other words: the expression levels can be synthetically made using a Gaussian distribution. Is there any study out there that specifies the same thing for next generation sequencing?

Appreciate any help, link to the related papers or hints.

Thanks!

 visivas 04-18-2011 02:24 PM

I am not aware of any software, but I thought you could start of with Q-Q plots with various common distributions. Then I would fit it to that distribution and observe the chi-square value to see if your expression values follow a pattern and at what level.

 d17 04-18-2011 04:01 PM

Several methods model expression levels using the negative binomial distribution - see e.g. DESeq, edgeR, baySeq ...

 NGHT 04-19-2011 01:34 PM

Thank you so much for your replies!

Actually, I am not looking for a software. I am more interested in a theoretical work which studied properties of data generated by RNA-Seq.

I will use the Q-Q plots for my samples though, thanks!

About DESeq, I looked at the paper. It is interesting and certainly in the direction that I needed; but I wish it was more validations for the assumption of negative binomial distribution. Maybe more biological validations ...

 Simon Anders 04-19-2011 11:11 PM

Actually there is less to it than you may think. Let me add some remarks, using the notation from our paper:

If the gene of interest has a concentration q in your prepared library, the expectation value for the counts from this gene is proportional to q. Let's fix the units for the concentration such that the expectation is sq, where s is the size factor for this sample.

From the mechanics of high-throughput sequencing, it can be seen by theoretical considerations that the distribution of counts K for the gene, given the concentration q, is Poisson distributed (with parameter sq.)

Let's say that the distribution of concentrations Q over biological replicates follows some density with mean µ and variance v=˛.

Hence, the variance of K, conditioned on Q=q, is sq, but once you marginalize out the condition, you get, by the law of total variance

Var K = sq + s˛˛

Now, we have to choose a distribution for Q. Purely out of mathematical convenience, we postulate it to follow a gamma distribution, because then, K will follow a negative binomial distribution, which is easy to handle.

We have no evidence that the gamma is the right distribution. In microarray times, people tended to rather assume a log-normal.

To my knowledge, there is no data set out there with enough samples to make a QQ plot to check which of these two might be more appropriate.

However, simulation shows that misspecification of the distribution for Q does not cause the results too change much, i.e., it does not really matter whether we use log-normal or gamma or something else of roughly this shape. (Sorry, can't find the references for this now; ask again if you need them.)

From microarrays, where we have data sets with enough samples, we know that the shape of these distributions is not completely off, and which of them to use does not seem to matter too much. Hence, everybody goes for negative binomial. This is, by the way, not a new idea: in many other branches of statistics that deal with count data, the NB is used since decades whenever overdispersion is present.

 NGHT 04-27-2011 09:29 PM

I do see your point and I liked you paper a lot too.

 edge 04-27-2011 11:56 PM

Apart from that, do you have any program recommend in order to identify gene expression level of RNA-seq?
Thanks first.

 Simon Anders 04-28-2011 12:18 AM

Quote:
 Originally Posted by edge (Post 40425) Hi Simon, can I have the link of your paper?
Sure:

Anders S, Huber W.: Differential expression analysis for sequence count data.
Genome Biology, 2010, 11:R106.

Quote:
 Apart from that, do you have any program recommend in order to identify gene expression level of RNA-seq?
What do you mean by "identify"? If you want an accurate estimate of expression for one sample, use e.g. cufflinks. If you want to compare expression between two experimental conditions, use our tool, DESeq, (described in the paper) or edgeR or baySeq (which are also based on the NB distribution). For more on the distinction between these two aims, see e.g. this thread.

 edge 04-28-2011 01:00 AM

Hi simon,

Below is the background of my RNA-seq data:
I got one set of human kidney sample which is sequenced by Illumina, 2X50bp, pair-end data.
I use RNA-seq short read assembler (Trinity) to assemble the RNA-seq and it generates around 20,000 contig (transcript).

After obtain the above assembly transcript, I would like to know its expression level.
Do you mean that I should use cufflinks to obtain an accurate estimate of expression for kidney tissue?
If I got one set of assembler transcript of kidney and breast tissue, I only use your DESeq for compare expression between kidney and breast tissue, am I right?

Thanks.

 tldgID 06-24-2011 10:16 AM

Quote:
 Originally Posted by Simon Anders (Post 39949) Actually there is less to it than you may think. Let me add some remarks, using the notation from our paper: If the gene of interest has a concentration q in your prepared library, the expectation value for the counts from this gene is proportional to q. Let's fix the units for the concentration such that the expectation is sq, where s is the size factor for this sample. From the mechanics of high-throughput sequencing, it can be seen by theoretical considerations that the distribution of counts K for the gene, given the concentration q, is Poisson distributed (with parameter sq.) Let's say that the distribution of concentrations Q over biological replicates follows some density with mean µ and variance v=aµ˛. Hence, the variance of K, conditioned on Q=q, is sq, but once you marginalize out the condition, you get, by the law of total variance Var K = sq + s˛aµ˛ Now, we have to choose a distribution for Q. Purely out of mathematical convenience, we postulate it to follow a gamma distribution, because then, K will follow a negative binomial distribution, which is easy to handle. We have no evidence that the gamma is the right distribution. In microarray times, people tended to rather assume a log-normal. To my knowledge, there is no data set out there with enough samples to make a QQ plot to check which of these two might be more appropriate. However, simulation shows that misspecification of the distribution for Q does not cause the results too change much, i.e., it does not really matter whether we use log-normal or gamma or something else of roughly this shape. (Sorry, can't find the references for this now; ask again if you need them.) From microarrays, where we have data sets with enough samples, we know that the shape of these distributions is not completely off, and which of them to use does not seem to matter too much. Hence, everybody goes for negative binomial. This is, by the way, not a new idea: in many other branches of statistics that deal with count data, the NB is used since decades whenever overdispersion is present.
Hi Simon,

The real gene concentration is q, from distribution Q, which is selected to be gamma. Now, here:

“the distribution of counts K for the gene, given the concentration q, is Poisson distributed (with parameter sq.)”

K is assumed to be a Poisson. Now my question 1 is:

1- K models number of reads mapped to the specific gene, right?

2- If the answer for question 1 is yes, then, how the K becomes negative binomial after conditioning on Q and selecting Q as gamma?

“Now, we have to choose a distribution for Q. Purely out of mathematical convenience, we postulate it to follow a gamma distribution, because then, K will follow a negative binomial distribution”

I know that the poisson with mean from gamma gives you negative binomial, but isn’t it true that K is already assumed to be a poisson? So finally what is the distribution of number of reads mapped to a gene?

3- If K is indeed negative binomial, what is the definition of the parameters of a usual negative binomial here? I mean how the “p” and especially “r” are defined? What does it mean to repeat the experiments until the failure happens (definition of negative binomial). This is important because in the mapping each reads will successes or fails to map to a gene, but we don’t have to try until we get certain number of fails or success. Therefore, reasoning for the K to be negative binomial and modeling the reads mapped to a gene, is confusing.

It would be highly appreciated if you can explain the meaning of the random variables and concepts in above questions.

 Simon Anders 06-24-2011 11:17 AM

Somehow, you managed to get everything the wrong way round:

Quote:
 how the K becomes negative binomial after conditioning on Q
It doesn't. K, conditioned on Q, is Poisson, and it becomes NB after marginalizing out the condition. You are confusing conditioning with marginalizing, which is the opposite.

Let's start again.

There are two ways how the NB comes about. (a) the definition that you cite, namely the waiting time until r failures happen. This definition is not the general one, as it makes sense only for integer r, and r may be non-integer, too. Also, we do not repeat anything until failure here. (b) as marginal distribution of a gamma-Poisson hierarchical model. The latter is what we use.

Let Q be a random variable with gamma distribution with expectation q0 and variance a q0˛. We draw a random variate q for Q and then set up a random variable K|q, which is Poisson distributed with mean q (and hence also variance q). If we marginalize out the conditioning of K|q on q, what do we get? Integrate the product of the pdf's for K|Q and Q, f_pois(k; q) and f_gamma( q; q0, a ), for q from 0 to infinity, and you get the pdf of the NB with mean q0 and dispersion a (i.e., variance q0 + a q0˛). Note that this has nothing to do with waiting times in Bernoulli processes -- only if r=1/a happens to be integer, it happens to be the same distribution as you get when waiting for r failures in a Bernoulli process with a success probability p such that p = pr/(1-p).

 tldgID 06-24-2011 12:16 PM

Thanks!

In simple words, K|Q ~ Poisson, meaning that the number of reads mapped to a gene, conditioned on the actual concentration for the gene, is Poisson. If we marginalize over concentrations, then the K: number of reads mapped to a gene, is distributed by negative binomial. In the cases that we don’t have all the actual concentrations, we can assume that the reads mapped to a gene is negative binomial then. Is that right?

 Simon Anders 06-24-2011 12:29 PM

Exactly. And of course we never have the actual concentrations because if we did we wouldn't need to measure them by sequencing.

 d17 06-24-2011 12:36 PM

Dear Simon (or anyone else who wishes to comment),

I have a (more general) question related to your excellent description above:
I am wondering about the differences between the negative binomial and Poisson lognormal models for analyzing count data. It appears that most NGS read count software takes the negative binomial approach (DESeq, edgeR, baySeq). However, it is my understanding that a Poisson lognormal model could also be suitable for analyzing this data. If I understand correctly the only difference between the models is that Q (using the notation above) is lognormal rather than gamma distributed (in the Poisson lognormal model).

So my question is: Is there any particular reason that people have chosen the negative binomial model instead of Poisson lognormal for read count data?

Thanks!

 tldgID 06-24-2011 12:39 PM

As I see it from some previous posts, the choice for distribution of Q won’t affect the distribution of K (the # of mapped reads) to be modeled as negative binomial:

"However, simulation shows that misspecification of the distribution for Q does not cause the results too change much, i.e., it does not really matter whether we use log-normal or gamma or something else of roughly this shape. (Sorry, can't find the references for this now; ask again if you need them.)"

I appreciate if you can let me know about any references that you were referring to in that message.

Thanks again for all your explanations :)

 d17 06-24-2011 12:47 PM

Quote:
 Originally Posted by tldgID (Post 44943) As I see it from some previous posts, the choice for distribution of Q won’t affect the distribution of K (the # of mapped reads) to be modeled as negative binomial: "However, simulation shows that misspecification of the distribution for Q does not cause the results too change much, i.e., it does not really matter whether we use log-normal or gamma or something else of roughly this shape. (Sorry, can't find the references for this now; ask again if you need them.)" I appreciate if you can let me know about any references that you were referring to in that message.
Thanks for pointing that out to me, tldgID!
I would still be curious, though, to find out whether there is any particular reason everyone has chosen negative binomial, or whether it is a historical accident of sorts. If it truly doesn't really matter whether you use log-normal or gamma for the distribution of Q, then I would expect that at least someone, somewhere would have published a Poisson log-normal model.

 Simon Anders 07-03-2011 05:12 AM

Quote:
 Originally Posted by tldgID (Post 44943) I appreciate if you can let me know about any references that you were referring to in that message.
I think I had this paper here in mind:

﻿Firth D:
Multiplicative Errors: Log-normal or Gamma?
J. R. Statist. Soc. B. 1988;50:266-268.

(But I don't have access to it at the moment, so I can't double-check whether it really supports my claim. I simply hope so. ;-) )

This one here might be relevant, too:

Jun Lu, John K Tomfohr and Thomas B Kepler:
BMC Bioinformatics 2005, 6:165
doi:10.1186/1471-2105-6-165

Simon

 tldgID 07-04-2011 09:54 AM

Thanks Simon! I'll check them out!

 anle 12-22-2011 04:30 AM

Hi Simon I have some things confused in mmy mind and I would need some help
You say:
"Let Q be a random variable with gamma distribution with expectation q0 and variance a q0˛"

If I understand well q is the shape and O the scale parameter of the Gamma. How did u end up to the conclusion that the distribution of Q is coming with these parameters? Also if I understand well then the P(K|q) follows poison with mean,var=q. How can one get again to this result??
If my question is to naive to explain maybe u can suggest some reading (bayesian statistics for dummies or sthng :rolleyes:)

 Simon Anders 12-22-2011 04:47 AM

There is no letter 'O'. This is a zero. I wanted to write 'q0' with zero as subscript but did not manage. (Don't know if the forum software allows subscripts.) So, 'q0' is one value (and distinct from q, which I use further down for the realization of the random variable Q) and 'a' is the other one.

I specified the first two moments. Solving the expressions for them for the shape and scale parameter, you get for the scale a q0 and for the shape 1/a.

By the way, this is all frequentist statistics. There is nothing Bayesian in here.

All times are GMT -8. The time now is 05:32 AM.