Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • DESeq: estimateSizeFactors, library size estimation

    I am working with high-throuput sequencing data and use DEseq to search for diff. exp. genes.
    The protocol is somewhat similar to RNA-Seq, but not exactly RNA-Seq (we are doing a custom paired end protocol with tags more likely to originate from the start site of transcripts).

    Anyway I use clustering of the tags to define genes, and then map the tags back onto the clusters ("genes") to get the counts of the clusters in different conditions, then feed this as input into DEseq.

    From what I understand, DESeq will estimate the library sizes with "estimateSizeFactors". However, this can only take into account tags that have contributed to my cluster definitions! But there are also tags which mapped somewhere, but did not contribute to a cluster definition.

    If I were using RNA-Seq, and would use RefSeq as the probes over which I determine the raw counts, I think the problem would be the same:

    Shouldn't I rather normalize by the total number of tags aligned? I.e. all tags mapped, not only the ones that contributed to clusters?

    Because imagine this situation: there are two libraries with very different sequencing depth. The different sequencing depths might mainly be caused by sporadic tags outside of your probes (genes, clusters). Inside the probes, the total amount of tags might be more or less similar between libraries. Then, if normalizing library size just by tags mapping within clusters, don't you get wrong results?

  • #2
    Just bumping my own thread - would be happy if someone could comment or share their thoughts on this issue. Thanks!

    Comment


    • #3
      I like to refer people to this publication, though it is a little old. It's a good starting point for someone to start to understand the normalization of RNA-Seq alignments across samples prior to differential expression computations.

      Background High-throughput sequencing technologies, such as the Illumina Genome Analyzer, are powerful new tools for investigating a wide range of biological and medical questions. Statistical and computational methods are key for drawing meaningful and accurate conclusions from the massive and complex datasets generated by the sequencers. We provide a detailed evaluation of statistical methods for normalization and differential expression (DE) analysis of Illumina transcriptome sequencing (mRNA-Seq) data. Results We compare statistical methods for detecting genes that are significantly DE between two types of biological samples and find that there are substantial differences in how the test statistics handle low-count genes. We evaluate how DE results are affected by features of the sequencing platform, such as, varying gene lengths, base-calling calibration method (with and without phi X control lane), and flow-cell/library preparation effects. We investigate the impact of the read count normalization method on DE results and show that the standard approach of scaling by total lane counts (e.g., RPKM) can bias estimates of DE. We propose more general quantile-based normalization procedures and demonstrate an improvement in DE detection. Conclusions Our results have significant practical and methodological implications for the design and analysis of mRNA-Seq experiments. They highlight the importance of appropriate statistical methods for normalization and DE inference, to account for features of the sequencing platform that could impact the accuracy of results. They also reveal the need for further research in the development of statistical and computational methods for mRNA-Seq.


      Another good explanation of normalization is written in the edgeR user's guide. See section 6: http://www.bioconductor.org/packages...UsersGuide.pdf

      edgeR is another R based differential expression tool. I believe it uses the exact same statistical test as DESeq but different variance models. I use it sometimes to give me a second opinion.

      as it turns out it's not necessarily safe to normalize by total tags aligned. you can kind of check if a normalization is "right" by plotting samples against each other. we would assume that most genes have not been differentially expressed between samples and especially between biological replicates. So if your normalization is good then when you plot those samples against each other in a scatter plot you should find the main body of points centered on a 1:1 line (y = x) showing that, in fact ,most genes are not differentially expressed. If the normalization is bad you'll see the main body of points is off to the side of that line indicating that there's some significant up or down regulation of most genes between the two samples.
      /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
      Salk Institute for Biological Studies, La Jolla, CA, USA */

      Comment


      • #4
        Thanks sdriscoll for these references, especially the Bullard paper

        Comment


        • #5
          As the question has been asked so often, we have added to our new paper on DEXSeq a supplemetary note explaining the rationale behind our normalization. See Supplementary Note S.1 here: http://precedings.nature.com/documents/6837

          Your own phrasing of the problem already indicates the answer anyway: Imagine that none of your clusters are differentially expressed, but due to some biological or technical artifact, the reads that do not map to clusters are more abundant in one than in the other sample. A normalization based on all counts will skew the clusters in order to account for the non-clusters.

          Comment


          • #6
            Genes with ZERO reads in DESeq normalization

            Originally posted by Simon Anders View Post
            As the question has been asked so often, we have added to our new paper on DEXSeq a supplemetary note explaining the rationale behind our normalization. See Supplementary Note S.1 here: http://precedings.nature.com/documents/6837

            Your own phrasing of the problem already indicates the answer anyway: Imagine that none of your clusters are differentially expressed, but due to some biological or technical artifact, the reads that do not map to clusters are more abundant in one than in the other sample. A normalization based on all counts will skew the clusters in order to account for the non-clusters.
            I am not quite clear how are those genes with 0 counted reads handled in estimateSizeFactors? Discarded?

            I asked this question because the sizeFactors are the same after I excluded those gene with ZERO counts at all conditions. Since the normalization depend on the mediean, and that's why I am confused.

            Comment


            • #7
              Yes, genes with _all_ zero counts are completely ignored.

              Comment


              • #8
                Originally posted by Simon Anders View Post
                Yes, genes with _all_ zero counts are completely ignored.
                Thank you. Simon. How about tags with ZERO count at some conditions (not all conditions)? Below is my R code, and the size factors for deseq/deseq1/deseq2 are the SAME. Again why? I understand there should be no difference between deseq and deseq1 according to your response.

                Code:
                counts = read.csv( counts.file, header=TRUE, row.names=1 )
                counts.i <- round(data.matrix( counts ))
                keep1 <- rowSums(counts.i[, ]) >= 1   #exclude those with ZERO at all conditions
                keep2 <- rowSums(counts.i[, ] >= 1 ) >= length(conditions) #exclude those with ZERO at any condition
                
                #> nrow(counts)
                #[1] 36004
                #> sum(keep1)
                #[1] 28867
                #> sum(keep2)
                #[1] 18330
                
                deseq = newCountDataSet( counts.i, conditions )
                deseq = estimateSizeFactors( deseq )
                
                deseq1 = newCountDataSet( counts.i[keep1,], conditions )
                deseq1 = estimateSizeFactors( deseq1 )
                
                deseq2 = newCountDataSet( counts.i[keep2,], conditions )
                deseq2 = estimateSizeFactors( deseq2 )
                > sizeFactors(deseq)
                JNJ1 JNJ2 JNJ3 JNJ4 JNJ5 JNJ6 JNJ7 JNJ8 JNJ9 JNJ10 JNJ11 JNJ12
                1.246 1.074 1.008 0.827 0.877 0.882 0.887 1.034 1.141 1.188 1.214 1.124
                > sizeFactors(deseq2)
                JNJ1 JNJ2 JNJ3 JNJ4 JNJ5 JNJ6 JNJ7 JNJ8 JNJ9 JNJ10 JNJ11 JNJ12
                1.246 1.074 1.008 0.827 0.877 0.882 0.887 1.034 1.141 1.188 1.214 1.124
                > sizeFactors(deseq1)
                JNJ1 JNJ2 JNJ3 JNJ4 JNJ5 JNJ6 JNJ7 JNJ8 JNJ9 JNJ10 JNJ11 JNJ12
                1.246 1.074 1.008 0.827 0.877 0.882 0.887 1.034 1.141 1.188 1.214 1.124

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM
                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                31 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                32 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                28 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-04-2024, 09:00 AM
                0 responses
                53 views
                0 likes
                Last Post seqadmin  
                Working...
                X