Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Gene expression levels – underlying distribution

    Hi there,

    I am trying to model the expression level of genes using RNA-Seq. I was wondering if there is any statistical analysis that identified the underlying distribution of the expressions for next generation sequencing. In the microarry days, it was assumed that they are Normally distributed, in other words: the expression levels can be synthetically made using a Gaussian distribution. Is there any study out there that specifies the same thing for next generation sequencing?

    Appreciate any help, link to the related papers or hints.

    Thanks!

  • #2
    I am not aware of any software, but I thought you could start of with Q-Q plots with various common distributions. Then I would fit it to that distribution and observe the chi-square value to see if your expression values follow a pattern and at what level.

    Comment


    • #3
      Several methods model expression levels using the negative binomial distribution - see e.g. DESeq, edgeR, baySeq ...

      Comment


      • #4
        Thank you so much for your replies!

        Actually, I am not looking for a software. I am more interested in a theoretical work which studied properties of data generated by RNA-Seq.

        I will use the Q-Q plots for my samples though, thanks!

        About DESeq, I looked at the paper. It is interesting and certainly in the direction that I needed; but I wish it was more validations for the assumption of negative binomial distribution. Maybe more biological validations ...

        Comment


        • #5
          Actually there is less to it than you may think. Let me add some remarks, using the notation from our paper:

          If the gene of interest has a concentration q in your prepared library, the expectation value for the counts from this gene is proportional to q. Let's fix the units for the concentration such that the expectation is sq, where s is the size factor for this sample.

          From the mechanics of high-throughput sequencing, it can be seen by theoretical considerations that the distribution of counts K for the gene, given the concentration q, is Poisson distributed (with parameter sq.)

          Let's say that the distribution of concentrations Q over biological replicates follows some density with mean µ and variance v=².

          Hence, the variance of K, conditioned on Q=q, is sq, but once you marginalize out the condition, you get, by the law of total variance

          Var K = sq + s²²

          Now, we have to choose a distribution for Q. Purely out of mathematical convenience, we postulate it to follow a gamma distribution, because then, K will follow a negative binomial distribution, which is easy to handle.

          We have no evidence that the gamma is the right distribution. In microarray times, people tended to rather assume a log-normal.

          To my knowledge, there is no data set out there with enough samples to make a QQ plot to check which of these two might be more appropriate.

          However, simulation shows that misspecification of the distribution for Q does not cause the results too change much, i.e., it does not really matter whether we use log-normal or gamma or something else of roughly this shape. (Sorry, can't find the references for this now; ask again if you need them.)

          From microarrays, where we have data sets with enough samples, we know that the shape of these distributions is not completely off, and which of them to use does not seem to matter too much. Hence, everybody goes for negative binomial. This is, by the way, not a new idea: in many other branches of statistics that deal with count data, the NB is used since decades whenever overdispersion is present.

          Comment


          • #6
            Thank you Simon for your reply!

            I do see your point and I liked you paper a lot too.

            Comment


            • #7
              Hi Simon, can I have the link of your paper?
              Apart from that, do you have any program recommend in order to identify gene expression level of RNA-seq?
              Thanks first.

              Comment


              • #8
                Originally posted by edge View Post
                Hi Simon, can I have the link of your paper?
                Sure:

                Anders S, Huber W.: Differential expression analysis for sequence count data.
                Genome Biology, 2010, 11:R106.

                Apart from that, do you have any program recommend in order to identify gene expression level of RNA-seq?
                What do you mean by "identify"? If you want an accurate estimate of expression for one sample, use e.g. cufflinks. If you want to compare expression between two experimental conditions, use our tool, DESeq, (described in the paper) or edgeR or baySeq (which are also based on the NB distribution). For more on the distinction between these two aims, see e.g. this thread.

                Comment


                • #9
                  Hi simon,

                  Below is the background of my RNA-seq data:
                  I got one set of human kidney sample which is sequenced by Illumina, 2X50bp, pair-end data.
                  I use RNA-seq short read assembler (Trinity) to assemble the RNA-seq and it generates around 20,000 contig (transcript).

                  After obtain the above assembly transcript, I would like to know its expression level.
                  Do you mean that I should use cufflinks to obtain an accurate estimate of expression for kidney tissue?
                  If I got one set of assembler transcript of kidney and breast tissue, I only use your DESeq for compare expression between kidney and breast tissue, am I right?

                  Thanks.

                  Comment


                  • #10
                    Originally posted by Simon Anders View Post
                    Actually there is less to it than you may think. Let me add some remarks, using the notation from our paper:

                    If the gene of interest has a concentration q in your prepared library, the expectation value for the counts from this gene is proportional to q. Let's fix the units for the concentration such that the expectation is sq, where s is the size factor for this sample.

                    From the mechanics of high-throughput sequencing, it can be seen by theoretical considerations that the distribution of counts K for the gene, given the concentration q, is Poisson distributed (with parameter sq.)

                    Let's say that the distribution of concentrations Q over biological replicates follows some density with mean µ and variance v=².

                    Hence, the variance of K, conditioned on Q=q, is sq, but once you marginalize out the condition, you get, by the law of total variance

                    Var K = sq + s²²

                    Now, we have to choose a distribution for Q. Purely out of mathematical convenience, we postulate it to follow a gamma distribution, because then, K will follow a negative binomial distribution, which is easy to handle.

                    We have no evidence that the gamma is the right distribution. In microarray times, people tended to rather assume a log-normal.

                    To my knowledge, there is no data set out there with enough samples to make a QQ plot to check which of these two might be more appropriate.

                    However, simulation shows that misspecification of the distribution for Q does not cause the results too change much, i.e., it does not really matter whether we use log-normal or gamma or something else of roughly this shape. (Sorry, can't find the references for this now; ask again if you need them.)

                    From microarrays, where we have data sets with enough samples, we know that the shape of these distributions is not completely off, and which of them to use does not seem to matter too much. Hence, everybody goes for negative binomial. This is, by the way, not a new idea: in many other branches of statistics that deal with count data, the NB is used since decades whenever overdispersion is present.
                    Hi Simon,

                    I am trying to understand the proposed model. After reading your paper and your comments here, now I have some questions:

                    The real gene concentration is q, from distribution Q, which is selected to be gamma. Now, here:

                    “the distribution of counts K for the gene, given the concentration q, is Poisson distributed (with parameter sq.)”

                    K is assumed to be a Poisson. Now my question 1 is:

                    1- K models number of reads mapped to the specific gene, right?

                    2- If the answer for question 1 is yes, then, how the K becomes negative binomial after conditioning on Q and selecting Q as gamma?

                    “Now, we have to choose a distribution for Q. Purely out of mathematical convenience, we postulate it to follow a gamma distribution, because then, K will follow a negative binomial distribution”

                    I know that the poisson with mean from gamma gives you negative binomial, but isn’t it true that K is already assumed to be a poisson? So finally what is the distribution of number of reads mapped to a gene?

                    3- If K is indeed negative binomial, what is the definition of the parameters of a usual negative binomial here? I mean how the “p” and especially “r” are defined? What does it mean to repeat the experiments until the failure happens (definition of negative binomial). This is important because in the mapping each reads will successes or fails to map to a gene, but we don’t have to try until we get certain number of fails or success. Therefore, reasoning for the K to be negative binomial and modeling the reads mapped to a gene, is confusing.

                    It would be highly appreciated if you can explain the meaning of the random variables and concepts in above questions.

                    Comment


                    • #11
                      Somehow, you managed to get everything the wrong way round:

                      how the K becomes negative binomial after conditioning on Q
                      It doesn't. K, conditioned on Q, is Poisson, and it becomes NB after marginalizing out the condition. You are confusing conditioning with marginalizing, which is the opposite.

                      Let's start again.

                      There are two ways how the NB comes about. (a) the definition that you cite, namely the waiting time until r failures happen. This definition is not the general one, as it makes sense only for integer r, and r may be non-integer, too. Also, we do not repeat anything until failure here. (b) as marginal distribution of a gamma-Poisson hierarchical model. The latter is what we use.

                      Let Q be a random variable with gamma distribution with expectation q0 and variance a q0². We draw a random variate q for Q and then set up a random variable K|q, which is Poisson distributed with mean q (and hence also variance q). If we marginalize out the conditioning of K|q on q, what do we get? Integrate the product of the pdf's for K|Q and Q, f_pois(k; q) and f_gamma( q; q0, a ), for q from 0 to infinity, and you get the pdf of the NB with mean q0 and dispersion a (i.e., variance q0 + a q0²). Note that this has nothing to do with waiting times in Bernoulli processes -- only if r=1/a happens to be integer, it happens to be the same distribution as you get when waiting for r failures in a Bernoulli process with a success probability p such that p = pr/(1-p).
                      Last edited by Simon Anders; 06-02-2012, 03:33 AM. Reason: inserted missing "not"

                      Comment


                      • #12
                        Thanks!

                        In simple words, K|Q ~ Poisson, meaning that the number of reads mapped to a gene, conditioned on the actual concentration for the gene, is Poisson. If we marginalize over concentrations, then the K: number of reads mapped to a gene, is distributed by negative binomial. In the cases that we don’t have all the actual concentrations, we can assume that the reads mapped to a gene is negative binomial then. Is that right?

                        Comment


                        • #13
                          Exactly. And of course we never have the actual concentrations because if we did we wouldn't need to measure them by sequencing.

                          Comment


                          • #14
                            Dear Simon (or anyone else who wishes to comment),

                            I have a (more general) question related to your excellent description above:
                            I am wondering about the differences between the negative binomial and Poisson lognormal models for analyzing count data. It appears that most NGS read count software takes the negative binomial approach (DESeq, edgeR, baySeq). However, it is my understanding that a Poisson lognormal model could also be suitable for analyzing this data. If I understand correctly the only difference between the models is that Q (using the notation above) is lognormal rather than gamma distributed (in the Poisson lognormal model).

                            So my question is: Is there any particular reason that people have chosen the negative binomial model instead of Poisson lognormal for read count data?

                            Thanks!

                            Comment


                            • #15
                              As I see it from some previous posts, the choice for distribution of Q won’t affect the distribution of K (the # of mapped reads) to be modeled as negative binomial:

                              "However, simulation shows that misspecification of the distribution for Q does not cause the results too change much, i.e., it does not really matter whether we use log-normal or gamma or something else of roughly this shape. (Sorry, can't find the references for this now; ask again if you need them.)"

                              I appreciate if you can let me know about any references that you were referring to in that message.

                              Thanks again for all your explanations

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              9 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              50 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X