Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • choosing & validating RNA-Seq time course data normalization method(s)

    Dear all,

    I seek your help with choosing & validating RNA-Seq time course data normalization method(s) for my work.

    My data set is 4 reps per time point, and 9 time points.
    I want to extract co-expressed genes based on their shared expression profiles over time. So I am NOT asking you how to perform pair-wise DE gene identification.

    I know there have been multiple posts on the topic of RNA-Seq data normalization. This is my 1st post here, so at the cost of being repetitive with some of my questions, and irking some or all of you, here I go:

    1. For my purposes I am assuming that raw mapped count data needs to first normalized, right?

    2. Should I test different methods of data normalization of my raw, mapped counts? Like TMM, quantile etc.?

    3. Strictly speaking, should the choice of normalization method be justified through some measure or test, or is it norm to try out different methods?

    4. Do both edgeR and DESeq offer different built-in methods of data normalization applicable for time-course data (NOT pair-wise comparisons)?

    5. Will normalization have to be performed with respect to a reference data point, lets say time point zero (which makes intuitive and biological sense to me)
    OR
    are there variants of normalization that can normalize data across time, but without explicitly choosing a reference (such a method, if it exists, does not make intuitive or biological sense to me)

    6. What is the best place for someone like myself, new to bio-statistics and the R environment, to quickly learn tricks of the trade?

    Lots of question I know, hoping this forum can help out a poor, starving grad student

    Thanks a ton.
    Wishing you all happy holidays and a fantastic 2012!

    AksR
    -----------------
    CTTATTGTTGAACTTOAATGGTGCTAATGATCCTCGTOTCTCCTGAACGT
    (translate THAT!)

  • #2
    I start by answering two of your questions:

    4. Do both edgeR and DESeq offer different built-in methods of data normalization applicable for time-course data (NOT pair-wise comparisons)?
    Normalization is independent of the experimental design. The built-in normalisations of DESeq and edgeR simply determine for each sample a scaling factor (or: size factor), such that all samples' counts, when multiplied with their factor, are on a scale that allows for comparisons. What you want compare with what is unimportant for this step.

    5. Will normalization have to be performed with respect to a reference data point, lets say time point zero (which makes intuitive and biological sense to me)
    OR
    are there variants of normalization that can normalize data across time, but without explicitly choosing a reference (such a method, if it exists, does not make intuitive or biological sense to me)
    DESeq chooses the size factors such that their product is one, in order to put the common scale somewhere in the middle of all the library sizes. If you multiplied all the factors by a constant, the analysis result would not change. Hence, one could as well declare an arbitrary sample as reference and chose the factors such that this sample gets assigned a one.

    3. Strictly speaking, should the choice of normalization method be justified through some measure or test, or is it norm to try out different methods?
    If the normalization does not work well, replicates will appear less similar than they are. This drives up the variance estimate and reduces the number of hits. Hence, in theory, a bad normalization should only reduce power, i.e., is conservative. I'm not sure, though, whether it would be a good idea to use the number of hits in the downstream test for differential expression as a figure of merits for the quality of the normalization; one might easily fall for outliers that way.

    Comment


    • #3
      Looking for co-expressed genes throughout time points? I haven't seen much of this in NGS papers yet. What about a clustering approach? Maybe this thread could help.

      Comment


      • #4
        Originally posted by Simon Anders View Post
        DESeq chooses the size factors such that their product is one, in order to put the common scale somewhere in the middle of all the library sizes. If you multiplied all the factors by a constant, the analysis result would not change. Hence, one could as well declare an arbitrary sample as reference and chose the factors such that this sample gets assigned a one.
        I have some questions regarding the calculation of the geometric mean to normalize individual libraries as implemented by estimateSizeFactors in DESeq.
        I checked out the DESeq package documentation for estimateSizeFactorsForMatrix

        Description:
        Given a matrix or data frame of count data, this function
        estimates the size factors as follows: Each column is divided by
        the geometric means of the rows. The median (or, ir requested,
        another location estimator) of these ratios (skipping the genes
        with a geometric mean of zero) is used as the size factor for this
        column.


        My question to the forum / Simon is very specifically about "skipping the genes with a geometric mean of zero"

        Skipping genes with a geometric mean of zero seems to me like it might miss quite a few genes, especially in my time course study, where across so many time points there is probably a higher chance, than for just a pairwise comparison with 2 time points, that even a highly expressed gene at time t1 may have zero expression at time t2. Such a gene would have 0 geometric mean, and would be consequently discarded. I would not want to discard such a gene from my analysis - quite the contrary actually.

        So for the purpose of not missing genes I am trying 2 things:
        a. pseudo-replace : substitute any raw count 0 to raw count 1, then perform the analysis,
        OR
        b. pseudo-add: add 1 to all raw counts, then perform the analysis

        Do my option a. or option b. violate the nBinom model or suffer from any intrinsic error that precludes correct conclusions ?

        I intend to use my slightly modified data from options a and b, to
        1, normalize using RLE (nomenclature from edgeR),
        2. perform VST if library variances are heteroskedastic, and
        3. finally perform fuzzy-K clustering to obtain dominant temporal patterns of expression.

        Looking forward to your opinions / comments / criticisms

        Comment


        • #5
          Don't worry. The genes with zero counts are just not used in the calculation of the size factors. They are, of course, not discarded and not excluded from the test for differential expression.

          Comment


          • #6
            Originally posted by Simon Anders View Post
            Don't worry. The genes with zero counts are just not used in the calculation of the size factors. They are, of course, not discarded and not excluded from the test for differential expression.
            Thanks Simon!

            Comment


            • #7
              For my time series - based clustering problem to find co-expressed genes with identical temporal expression profiles (which is NOT the same as DE gene identification), I would assume there is still the problem of over-dispersion across multiple biological replicates we have. So will DESeq help perform the variance stabilization transformation, after which I can use this transformed data for time series clustering?

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Essential Discoveries and Tools in Epitranscriptomics
                by seqadmin


                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                Today, 07:01 AM
              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              37 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              41 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              35 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-04-2024, 09:00 AM
              0 responses
              54 views
              0 likes
              Last Post seqadmin  
              Working...
              X