Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • slny
    Member
    • Mar 2011
    • 54

    RNA seq data normalization question

    Hi,

    Currently I'm working on mRNA Seq and have a question about data normalization.

    If the data is already normalized with RPKM, should I further normalize the data, for example TMM?

    Thanks,

    slny
  • bioinfosm
    Senior Member
    • Jan 2008
    • 483

    #2
    am not sure what you mean by TMM?
    --
    bioinfosm

    Comment

    • Kennels
      Senior Member
      • Feb 2011
      • 149

      #3
      Hello,

      I think if you did RPKM first, that would incorporate any RNA library compositional bias that TMM aims to compensate for, so if you would want to take the compositional bias into account, perhaps use the scaling factor produced by TMM first to adjust the library read counts and then proceed to do RPKM? Or just use the edgeR package in its entirety.

      Ken

      Comment

      • Simon Anders
        Senior Member
        • Feb 2010
        • 995

        #4
        Better tell us what you want to do afterwards with your normalized data. This may influence how you want to normalize.

        Comment

        • slny
          Member
          • Mar 2011
          • 54

          #5
          Thanks a lot for all the responses.

          Currently I have mRNA seq data for two groups and would like to find out differentially expressed genes. Currently I use countOverlaps function to count the reads for each gene and then use edgeR or DESeq for data normalization and differential analysis.

          Because the expression level should be the count of reads for each gene divided by the gene length, I wonder whether I should normalize the data with RPKM first and then further normalize the data with TMM in edgeR.

          For bioinfosm's question, TMM is a normalization method used by edgeR package. TMM should be kind of global normalization (not very sure).

          Comment

          • mgogol
            Senior Member
            • Mar 2008
            • 197

            #6
            TMM is trimmed mean of M-values and is performed on the counts, not on the RPKM. It's a way to control for samples with different populations of RNA by sort of computing a "global fold change" between samples using a trimmed mean as a scaling factor. If your samples are kind of similar to eachother, you might not need it, but if you're worried about different populations of RNAs, TMM normalization might help. Then you would use the TMM normalized read counts to compute differential expresion.

            Comment

            • Simon Anders
              Senior Member
              • Feb 2010
              • 995

              #7
              The normalization methods in DESeq and edgeR are meant to be fed with raw, integer counts. Please do not divide by transcript length before the DE analysis; it will screw up the whole method. For visualization purposes, you may want to divide the normalized counts by transcript length afterwards. (In DESeq, you get normalized counts by dividing the raw counts by the appropriate size factor.) However, think carefully about what to use as transcript length The original idea of using the sum of all exon lengths was not that good (see, e.g., the cufflinks paper).

              Comment

              • slny
                Member
                • Mar 2011
                • 54

                #8
                Does TMM consider gene length? If not, how could I adjust the gene expression from the read count for each gene?

                Comment

                • Simon Anders
                  Senior Member
                  • Feb 2010
                  • 995

                  #9
                  Originally posted by slny View Post
                  Does TMM consider gene length? If not, how could I adjust the gene expression from the read count for each gene?
                  No, it doesn't, because it doesn't need to.

                  This is why I asked what you want to do with your data.

                  If you want to test for differential expression, you want to compare the expression of the same gene in different samples. As the gene has the same length in all your samples, there is no point in dividing by the gene length. You only mask the information on how precise your measurement is.

                  If you want to compare a gene with another gene, then you may want to divide by gene length, but you should be aware that such a comparison opens a whole new can of worms.

                  Comment

                  • slny
                    Member
                    • Mar 2011
                    • 54

                    #10
                    Perfect explanation. Thanks a lot!

                    One more question. Should I log transform the count of reads before I normalize the data?

                    Comment

                    • Simon Anders
                      Senior Member
                      • Feb 2010
                      • 995

                      #11
                      No.

                      By "normalize", do you mean using DESeq's and edgeR's normalisation methods? They expect raw, integer counts, see above.

                      Or do you mean dividing by transcript length? This does not make sense on the log scale, for obvious reasons.

                      Comment

                      • slny
                        Member
                        • Mar 2011
                        • 54

                        #12
                        If we use poisson distribution or negative binomial distribution for differential analysis, then we should not log transformation because of discrete probability distribution.

                        Why do we use these discrete probability distributions in sequencing analysis, but normal distribution in microarray data analysis? Could we log transform the mRNA seq data and normalize the data with quantile normalization? If so, we can still use t test to select differentially expressed genes.

                        Comment

                        • steven
                          Senior Member
                          • Aug 2009
                          • 269

                          #13
                          +1: Do not log-transform count data.

                          Comment

                          • A Oshlack
                            Member
                            • Jun 2010
                            • 17

                            #14
                            Originally posted by slny View Post
                            Why do we use these discrete probability distributions in sequencing analysis, but normal distribution in microarray data analysis? Could we log transform the mRNA seq data and normalize the data with quantile normalization? If so, we can still use t test to select differentially expressed genes.
                            Cloonan et al, Nature Methods did exactly what you suggest. However, microarray data is fundamentally different as expression is measured indirectly by fluorescence of probes and seems to behave normally on the log scale. For sequencing data this is not the case i.e. when you log a Poisson distribution it's not normally distributed. We actually tested the Cloonan method in our simulation for the TMM paper and it performed significantly worse than count based methods but I don't think that result made it into the paper.

                            One comment on RPKM. In my opinion one would want to divide by gene length when you are looking at absolute expression of a gene i.e. comparing between genes rather than comparing between samples. However to do a proper comparison between genes you really need to take into account other biases such a sequence compositions.

                            Comment

                            • pbseq
                              Member
                              • Feb 2010
                              • 16

                              #15
                              maybe sligthly off topic but is RNA-seq counting-related:
                              I always hear about RPKM but, to me, counting gene expressione by covered bases (and not nymber of reads ) looks more precise to me. base counting instead of read counting is very easy (e.g. with SeqMonk software) but is soo poorly mentioned that I'm wondering if it's OK for downstream applications.

                              BTW, for differential expression purposes, I use SeQmonk for harvesting raw data as follows: I select probes of interest (e.g, genes, mRNA or intergenic regions ) , I count data by bases (I do not correct for number of total reads, or gene length and don't log transform) and then feed the raw data to DESeq or EDGER. Upto looks fine to me (at least for my poor experience ).. any warnings?
                              thanks for any comments !

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                Yesterday, 10:05 AM
                              • SEQadmin2
                                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                by SEQadmin2


                                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                Introduction

                                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                05-22-2026, 06:42 AM
                              • SEQadmin2
                                Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                by SEQadmin2

                                Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                05-06-2026, 09:04 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Yesterday, 12:03 PM
                              0 responses
                              19 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, Yesterday, 11:40 AM
                              0 responses
                              14 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-28-2026, 11:40 AM
                              0 responses
                              29 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-26-2026, 10:12 AM
                              0 responses
                              31 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...