Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • RNA-Seq, Differential Expression: a theoretical question of modeling methodology

    Could you please help understand where I went wrong on the issue described below.

    Suppose I am interested in detecting differential expression (DE) of a fixed transcript X between Tumor and Normal conditions with only one replicate (library) per condition.

    Tumor library has 1 million reads, of which 50,000 map to transcript X.

    Normal library has 2 million reads, of which 200,000 map to transcript X.

    There are two major modeling frameworks.

    1)The most conventional one (implemented in edgeR, EBSeq, Cuffdiff, and many others) is to

    a) adjust the “raw” count for the library size; b) assume that the adjusted count comes from a certain distribution with unlimited support (Poisson, Negative Binomial, etc) ; and c) fit a regression model where covariates correspond to the conditions and adjusted counts correspond to the response.

    However, in this example there will be only one adjusted count per condition, and all such models will have zero degrees of freedom for the error. No p-values will be produced. In particular, edgeR is pitched as a method to use for low replication scenario, but it still requires at least one condition that has two or more replicates.

    2)Assume a Binomial trials scheme for each library. Eg, for Tumor library there is 1 million trials, and 50,000 successes. The null hypothesis says that the probability of success is the same in both libraries. This framework is equivalent to fitting a logistic regression with a factor that has two levels.

    Most importantly, this model is well replicated: it has as many observations as the total number of reads in the dataset, i.e. 3 million. Each observation is equal to 1, if the corresponding read maps to transcript X, and zero, otherwise. In Method 1), X has only one observation under Tumor condition. In Method 2), it has 1 million observations under Tumor.

    When there are a few factors in the model, it should work the same way. Because of low replication, Method 1) will often boil down to n-way ANOVA with one replicate per cell. If we switch to Method 2), each point that was considered a single observation in 1) will expand to as many replicates as there are reads in the corresponding library.

    Therefore, I fail to understand why framework 2) has not been used all over the place to avoid the replication problem that is so common in RNA-Seq studies. Apparently, there should be a good reason. If you have an idea, please let me know.

    Regards,
    Nik

  • #2
    One can also say that Model 2) contains more information about the data: it is always possible to convert 2) into 1), but not the other way round. Why do they lose information on purpose?

    Comment


    • #3
      Dear Nic

      thanks. This is a good question that has probably already been asked by everyone working in this field. Here's what has motivated us to follow the approach used by DESeq.

      Biologists are usually not just interested in rejecting the null hypothesis of no differential expression overall, but want to pinpoint the particular genes affected. Your test for model 2 is, afaIcs, more susceptible to rejections for one gene when in fact other genes are differentially expressed (esp. if the latter take up a lot of reads).

      Also, in model 1 it is straightforward to add a layer that accounts for overdispersion (i.e. biological variation in the rates underlying the counting/sampling), which is crucial for applications. I am sure that can also be done for model 2, but am less aware where it has been done.

      Best wishes
      Wolfgang
      Wolfgang Huber
      EMBL

      Comment


      • #4
        Dear Nik,

        my opinion to your example is (regardless how it was measured: with millions of photons collected by the camera in case of microarrays or thousands of reads sequenced from the libraries - in the first case the photons are treated as one analog value and in the second case as digital count data, which has tremendous impact on the used statistical methods...):

        finally, the unit of transcriptomics is the gene/transcript, not the read:
        T: 1 (0.05 <- 5e4/1e6)
        N: 2 (0.1 <- 2e5/2e6)

        what a biologist wants to know is:
        is the difference in expression of a gene/transcript statistically significant and relevant? the latter is a matter of real biological experiments (or at least a matter of biological interpretation of omics-data) the former a matter of biological replicates.

        to see if the difference 1:2 is significant you have to know the biological variation in both groups! and this can only be done with biological replicates, not with technical or with modelling of variance with borrowed information from other genes or with using reads as units of interest...

        this is all the more true for cancer and normal samples, because here the variance for a gene can be completely different in both groups as well as compared to other genes...

        dietmar
        Last edited by dietmar13; 04-13-2013, 09:18 AM. Reason: typo

        Comment


        • #5
          Your test for model 2 is, afaIcs, more susceptible to rejections for one gene when in fact other genes are differentially expressed (esp. if the latter take up a lot of reads).

          -------------

          In Model 2 it is obvious, but Model 1 has the very same problem. If, for a given library, transcript X attracts more reads, it means that less of them are left for the rest of transcripts. One can use some normalization tricks to mitigate that, but it's unlikely to resolve the issue completely.

          Comment


          • #6
            this is all the more true for cancer and normal samples, because here the variance for a gene can be completely different in both groups as well as compared to other genes...

            -------------

            How do you know what can or cannot be different? If that knowledge comes from having lots of biological replicates, then the issue is moot to begin with.

            Comment


            • #7
              Actually, I made a mistake. If Poisson is used in Model 1, then the p-value can be obtained even without replication. The Poisson setup approximately corresponds to the raw counts being Poisson(lambda = n * p), where n is the library size and p is the probability of success from Model 2. If we normalize the "raw" count by dividing it by the library size, Model 1 will be about the same as Model 2.

              It's somewhat confusing that, even though we are interested in proportion p, the normalized proxy for that proportion is a count that itself is assumed to come from Poisson distribution.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Essential Discoveries and Tools in Epitranscriptomics
                by seqadmin




                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                04-22-2024, 07:01 AM
              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Yesterday, 11:49 AM
              0 responses
              15 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-24-2024, 08:47 AM
              0 responses
              16 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              61 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              60 views
              0 likes
              Last Post seqadmin  
              Working...
              X