SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
raw data of paper "The evolution of gene expression levels in mammalian organs" rzhang Illumina/Solexa 5 02-27-2012 05:08 PM
Non-linear model for RNA-Seq (reads and expression levels) tldgID RNA Sequencing 1 08-22-2011 07:06 AM
ChIP-Seq: EpiChIP: gene-by-gene quantification of epigenetic modification levels. Newsbot! Literature Watch 0 12-07-2010 02:00 AM
Estimation of expression levels kenosaki RNA Sequencing 4 08-04-2010 01:06 AM
Compare Gene Expression levels pfranchini Bioinformatics 0 09-14-2009 05:48 AM

Reply
 
Thread Tools
Old 04-18-2011, 12:02 PM   #1
NGHT
Junior Member
 
Location: USA

Join Date: Oct 2010
Posts: 3
Default Gene expression levels – underlying distribution

Hi there,

I am trying to model the expression level of genes using RNA-Seq. I was wondering if there is any statistical analysis that identified the underlying distribution of the expressions for next generation sequencing. In the microarry days, it was assumed that they are Normally distributed, in other words: the expression levels can be synthetically made using a Gaussian distribution. Is there any study out there that specifies the same thing for next generation sequencing?

Appreciate any help, link to the related papers or hints.

Thanks!
NGHT is offline   Reply With Quote
Old 04-18-2011, 02:24 PM   #2
visivas
Junior Member
 
Location: Boston, USA

Join Date: May 2010
Posts: 7
Default

I am not aware of any software, but I thought you could start of with Q-Q plots with various common distributions. Then I would fit it to that distribution and observe the chi-square value to see if your expression values follow a pattern and at what level.
visivas is offline   Reply With Quote
Old 04-18-2011, 04:01 PM   #3
d17
Member
 
Location: United States

Join Date: Sep 2008
Posts: 27
Default

Several methods model expression levels using the negative binomial distribution - see e.g. DESeq, edgeR, baySeq ...
d17 is offline   Reply With Quote
Old 04-19-2011, 01:34 PM   #4
NGHT
Junior Member
 
Location: USA

Join Date: Oct 2010
Posts: 3
Post

Thank you so much for your replies!

Actually, I am not looking for a software. I am more interested in a theoretical work which studied properties of data generated by RNA-Seq.

I will use the Q-Q plots for my samples though, thanks!

About DESeq, I looked at the paper. It is interesting and certainly in the direction that I needed; but I wish it was more validations for the assumption of negative binomial distribution. Maybe more biological validations ...
NGHT is offline   Reply With Quote
Old 04-19-2011, 11:11 PM   #5
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Actually there is less to it than you may think. Let me add some remarks, using the notation from our paper:

If the gene of interest has a concentration q in your prepared library, the expectation value for the counts from this gene is proportional to q. Let's fix the units for the concentration such that the expectation is sq, where s is the size factor for this sample.

From the mechanics of high-throughput sequencing, it can be seen by theoretical considerations that the distribution of counts K for the gene, given the concentration q, is Poisson distributed (with parameter sq.)

Let's say that the distribution of concentrations Q over biological replicates follows some density with mean µ and variance v=˛.

Hence, the variance of K, conditioned on Q=q, is sq, but once you marginalize out the condition, you get, by the law of total variance

Var K = sq + s˛˛

Now, we have to choose a distribution for Q. Purely out of mathematical convenience, we postulate it to follow a gamma distribution, because then, K will follow a negative binomial distribution, which is easy to handle.

We have no evidence that the gamma is the right distribution. In microarray times, people tended to rather assume a log-normal.

To my knowledge, there is no data set out there with enough samples to make a QQ plot to check which of these two might be more appropriate.

However, simulation shows that misspecification of the distribution for Q does not cause the results too change much, i.e., it does not really matter whether we use log-normal or gamma or something else of roughly this shape. (Sorry, can't find the references for this now; ask again if you need them.)

From microarrays, where we have data sets with enough samples, we know that the shape of these distributions is not completely off, and which of them to use does not seem to matter too much. Hence, everybody goes for negative binomial. This is, by the way, not a new idea: in many other branches of statistics that deal with count data, the NB is used since decades whenever overdispersion is present.
Simon Anders is offline   Reply With Quote
Old 04-27-2011, 09:29 PM   #6
NGHT
Junior Member
 
Location: USA

Join Date: Oct 2010
Posts: 3
Default

Thank you Simon for your reply!

I do see your point and I liked you paper a lot too.
NGHT is offline   Reply With Quote
Old 04-27-2011, 11:56 PM   #7
edge
Senior Member
 
Location: China

Join Date: Sep 2009
Posts: 199
Default

Hi Simon, can I have the link of your paper?
Apart from that, do you have any program recommend in order to identify gene expression level of RNA-seq?
Thanks first.
edge is offline   Reply With Quote
Old 04-28-2011, 12:18 AM   #8
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Quote:
Originally Posted by edge View Post
Hi Simon, can I have the link of your paper?
Sure:

Anders S, Huber W.: Differential expression analysis for sequence count data.
Genome Biology, 2010, 11:R106.

Quote:
Apart from that, do you have any program recommend in order to identify gene expression level of RNA-seq?
What do you mean by "identify"? If you want an accurate estimate of expression for one sample, use e.g. cufflinks. If you want to compare expression between two experimental conditions, use our tool, DESeq, (described in the paper) or edgeR or baySeq (which are also based on the NB distribution). For more on the distinction between these two aims, see e.g. this thread.
Simon Anders is offline   Reply With Quote
Old 04-28-2011, 01:00 AM   #9
edge
Senior Member
 
Location: China

Join Date: Sep 2009
Posts: 199
Default

Hi simon,

Below is the background of my RNA-seq data:
I got one set of human kidney sample which is sequenced by Illumina, 2X50bp, pair-end data.
I use RNA-seq short read assembler (Trinity) to assemble the RNA-seq and it generates around 20,000 contig (transcript).

After obtain the above assembly transcript, I would like to know its expression level.
Do you mean that I should use cufflinks to obtain an accurate estimate of expression for kidney tissue?
If I got one set of assembler transcript of kidney and breast tissue, I only use your DESeq for compare expression between kidney and breast tissue, am I right?

Thanks.
edge is offline   Reply With Quote
Old 06-24-2011, 10:16 AM   #10
tldgID
Member
 
Location: USA

Join Date: May 2011
Posts: 18
Default

Quote:
Originally Posted by Simon Anders View Post
Actually there is less to it than you may think. Let me add some remarks, using the notation from our paper:

If the gene of interest has a concentration q in your prepared library, the expectation value for the counts from this gene is proportional to q. Let's fix the units for the concentration such that the expectation is sq, where s is the size factor for this sample.

From the mechanics of high-throughput sequencing, it can be seen by theoretical considerations that the distribution of counts K for the gene, given the concentration q, is Poisson distributed (with parameter sq.)

Let's say that the distribution of concentrations Q over biological replicates follows some density with mean µ and variance v=˛.

Hence, the variance of K, conditioned on Q=q, is sq, but once you marginalize out the condition, you get, by the law of total variance

Var K = sq + s˛˛

Now, we have to choose a distribution for Q. Purely out of mathematical convenience, we postulate it to follow a gamma distribution, because then, K will follow a negative binomial distribution, which is easy to handle.

We have no evidence that the gamma is the right distribution. In microarray times, people tended to rather assume a log-normal.

To my knowledge, there is no data set out there with enough samples to make a QQ plot to check which of these two might be more appropriate.

However, simulation shows that misspecification of the distribution for Q does not cause the results too change much, i.e., it does not really matter whether we use log-normal or gamma or something else of roughly this shape. (Sorry, can't find the references for this now; ask again if you need them.)

From microarrays, where we have data sets with enough samples, we know that the shape of these distributions is not completely off, and which of them to use does not seem to matter too much. Hence, everybody goes for negative binomial. This is, by the way, not a new idea: in many other branches of statistics that deal with count data, the NB is used since decades whenever overdispersion is present.
Hi Simon,

I am trying to understand the proposed model. After reading your paper and your comments here, now I have some questions:

The real gene concentration is q, from distribution Q, which is selected to be gamma. Now, here:

“the distribution of counts K for the gene, given the concentration q, is Poisson distributed (with parameter sq.)”

K is assumed to be a Poisson. Now my question 1 is:

1- K models number of reads mapped to the specific gene, right?

2- If the answer for question 1 is yes, then, how the K becomes negative binomial after conditioning on Q and selecting Q as gamma?

“Now, we have to choose a distribution for Q. Purely out of mathematical convenience, we postulate it to follow a gamma distribution, because then, K will follow a negative binomial distribution”

I know that the poisson with mean from gamma gives you negative binomial, but isn’t it true that K is already assumed to be a poisson? So finally what is the distribution of number of reads mapped to a gene?

3- If K is indeed negative binomial, what is the definition of the parameters of a usual negative binomial here? I mean how the “p” and especially “r” are defined? What does it mean to repeat the experiments until the failure happens (definition of negative binomial). This is important because in the mapping each reads will successes or fails to map to a gene, but we don’t have to try until we get certain number of fails or success. Therefore, reasoning for the K to be negative binomial and modeling the reads mapped to a gene, is confusing.

It would be highly appreciated if you can explain the meaning of the random variables and concepts in above questions.
tldgID is offline   Reply With Quote
Old 06-24-2011, 11:17 AM   #11
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Somehow, you managed to get everything the wrong way round:

Quote:
how the K becomes negative binomial after conditioning on Q
It doesn't. K, conditioned on Q, is Poisson, and it becomes NB after marginalizing out the condition. You are confusing conditioning with marginalizing, which is the opposite.

Let's start again.

There are two ways how the NB comes about. (a) the definition that you cite, namely the waiting time until r failures happen. This definition is not the general one, as it makes sense only for integer r, and r may be non-integer, too. Also, we do not repeat anything until failure here. (b) as marginal distribution of a gamma-Poisson hierarchical model. The latter is what we use.

Let Q be a random variable with gamma distribution with expectation q0 and variance a q0˛. We draw a random variate q for Q and then set up a random variable K|q, which is Poisson distributed with mean q (and hence also variance q). If we marginalize out the conditioning of K|q on q, what do we get? Integrate the product of the pdf's for K|Q and Q, f_pois(k; q) and f_gamma( q; q0, a ), for q from 0 to infinity, and you get the pdf of the NB with mean q0 and dispersion a (i.e., variance q0 + a q0˛). Note that this has nothing to do with waiting times in Bernoulli processes -- only if r=1/a happens to be integer, it happens to be the same distribution as you get when waiting for r failures in a Bernoulli process with a success probability p such that p = pr/(1-p).

Last edited by Simon Anders; 06-02-2012 at 03:33 AM. Reason: inserted missing "not"
Simon Anders is offline   Reply With Quote
Old 06-24-2011, 12:16 PM   #12
tldgID
Member
 
Location: USA

Join Date: May 2011
Posts: 18
Default

Thanks!

In simple words, K|Q ~ Poisson, meaning that the number of reads mapped to a gene, conditioned on the actual concentration for the gene, is Poisson. If we marginalize over concentrations, then the K: number of reads mapped to a gene, is distributed by negative binomial. In the cases that we don’t have all the actual concentrations, we can assume that the reads mapped to a gene is negative binomial then. Is that right?
tldgID is offline   Reply With Quote
Old 06-24-2011, 12:29 PM   #13
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Exactly. And of course we never have the actual concentrations because if we did we wouldn't need to measure them by sequencing.
Simon Anders is offline   Reply With Quote
Old 06-24-2011, 12:36 PM   #14
d17
Member
 
Location: United States

Join Date: Sep 2008
Posts: 27
Default

Dear Simon (or anyone else who wishes to comment),

I have a (more general) question related to your excellent description above:
I am wondering about the differences between the negative binomial and Poisson lognormal models for analyzing count data. It appears that most NGS read count software takes the negative binomial approach (DESeq, edgeR, baySeq). However, it is my understanding that a Poisson lognormal model could also be suitable for analyzing this data. If I understand correctly the only difference between the models is that Q (using the notation above) is lognormal rather than gamma distributed (in the Poisson lognormal model).

So my question is: Is there any particular reason that people have chosen the negative binomial model instead of Poisson lognormal for read count data?

Thanks!
d17 is offline   Reply With Quote
Old 06-24-2011, 12:39 PM   #15
tldgID
Member
 
Location: USA

Join Date: May 2011
Posts: 18
Default

As I see it from some previous posts, the choice for distribution of Q won’t affect the distribution of K (the # of mapped reads) to be modeled as negative binomial:

"However, simulation shows that misspecification of the distribution for Q does not cause the results too change much, i.e., it does not really matter whether we use log-normal or gamma or something else of roughly this shape. (Sorry, can't find the references for this now; ask again if you need them.)"

I appreciate if you can let me know about any references that you were referring to in that message.

Thanks again for all your explanations
tldgID is offline   Reply With Quote
Old 06-24-2011, 12:47 PM   #16
d17
Member
 
Location: United States

Join Date: Sep 2008
Posts: 27
Default

Quote:
Originally Posted by tldgID View Post
As I see it from some previous posts, the choice for distribution of Q won’t affect the distribution of K (the # of mapped reads) to be modeled as negative binomial:

"However, simulation shows that misspecification of the distribution for Q does not cause the results too change much, i.e., it does not really matter whether we use log-normal or gamma or something else of roughly this shape. (Sorry, can't find the references for this now; ask again if you need them.)"

I appreciate if you can let me know about any references that you were referring to in that message.
Thanks for pointing that out to me, tldgID!
I would still be curious, though, to find out whether there is any particular reason everyone has chosen negative binomial, or whether it is a historical accident of sorts. If it truly doesn't really matter whether you use log-normal or gamma for the distribution of Q, then I would expect that at least someone, somewhere would have published a Poisson log-normal model.
d17 is offline   Reply With Quote
Old 07-03-2011, 05:12 AM   #17
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Quote:
Originally Posted by tldgID View Post
I appreciate if you can let me know about any references that you were referring to in that message.
I think I had this paper here in mind:

Firth D:
Multiplicative Errors: Log-normal or Gamma?
J. R. Statist. Soc. B. 1988;50:266-268.

(But I don't have access to it at the moment, so I can't double-check whether it really supports my claim. I simply hope so. ;-) )


This one here might be relevant, too:

Jun Lu, John K Tomfohr and Thomas B Kepler:
BMC Bioinformatics 2005, 6:165
doi:10.1186/1471-2105-6-165


Simon

Last edited by Simon Anders; 07-03-2011 at 05:25 AM.
Simon Anders is offline   Reply With Quote
Old 07-04-2011, 09:54 AM   #18
tldgID
Member
 
Location: USA

Join Date: May 2011
Posts: 18
Default

Thanks Simon! I'll check them out!
tldgID is offline   Reply With Quote
Old 12-22-2011, 04:30 AM   #19
anle
Junior Member
 
Location: Germany

Join Date: Mar 2011
Posts: 7
Default

Hi Simon I have some things confused in mmy mind and I would need some help
You say:
"Let Q be a random variable with gamma distribution with expectation q0 and variance a q0˛"

If I understand well q is the shape and O the scale parameter of the Gamma. How did u end up to the conclusion that the distribution of Q is coming with these parameters? Also if I understand well then the P(K|q) follows poison with mean,var=q. How can one get again to this result??
If my question is to naive to explain maybe u can suggest some reading (bayesian statistics for dummies or sthng )
anle is offline   Reply With Quote
Old 12-22-2011, 04:47 AM   #20
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

There is no letter 'O'. This is a zero. I wanted to write 'q0' with zero as subscript but did not manage. (Don't know if the forum software allows subscripts.) So, 'q0' is one value (and distinct from q, which I use further down for the realization of the random variable Q) and 'a' is the other one.

I specified the first two moments. Solving the expressions for them for the shape and scale parameter, you get for the scale a q0 and for the shape 1/a.

By the way, this is all frequentist statistics. There is nothing Bayesian in here.
Simon Anders is offline   Reply With Quote
Reply

Tags
distribution, expression level, gaussian, normal, statistical

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:24 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO