Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to get normalized count from DESeq?

    Hi all,

    I have different libraries of NGS and want to use DESeq to normalize them. I want to make the count comparable among different libraries, but how can I get the normalized count (in integer number but not float value)?

    In the DESeq mannual, there is code like this:
    cds <- newCountDataSet( countsTable, conds )
    cds <- estimateSizeFactors( cds )
    cds <- estimateVarianceFunctions( cds )
    res <- nbinomTest( cds, "T", "N")

    And about the normalization, I have sample1,2,3, each sample only has one replicate, if I want to compare sample1 and 2, should I normalize between 1 and 2, while if to compare 2 and 3, then again normalized between 2 and 3? Can I normalize sample 1,2,3 together then compare 1,2 or 2,3?


    Thanks!

  • #2
    You can divide your counts by the size factors:

    normalizedCounts <- t( t(counts(cds)) / sizeFactors(cds) )

    Now, the samples are on a common scale.

    Of course, these are not integer values. However, I cannot quite see why you would want to have integer values.

    Simon

    Comment


    • #3
      Originally posted by Simon Anders View Post
      You can divide your counts by the size factors:

      normalizedCounts <- t( t(counts(cds)) / sizeFactors(cds) )

      Now, the samples are on a common scale.

      Of course, these are not integer values. However, I cannot quite see why you would want to have integer values.

      Simon
      Thanks Simon, actually float is OK too, or I can round to interger.

      Comment


      • #4
        Normalized counts of common scale!

        Hi, Simon:

        I have a question related to your last post about the normalized counts of common scale. My RNA-seq data composes of seven conditions each with 2 replicates. That is a time series experiment for seed embryo development. I was wondering if I could get the normalized counts of the all the development stages with the common scale to compare the expression across all the time courses.
        I tried your code, and the result showed similar scale only within the two replicates of each stage, but between stages they are still quite different. Here is the normalized counts sum for each library. Note same letter indicates same stage, and 1, 2 indicate replicates.
        Code:
        Z1, Z2, O1, O2, G1, G2, H1, H2, T1, T2, B1, B2, M1, M2
        25581381.8, 26292834.7, 14023619.9, 12161472.3, 18581478.1, 18749724.2,	10861736.7, 11585558.2, 9996853.0, 10318368.5, 14130401.4, 16239989.6, 30725888.0, 27787233.7
        And the original reads sums for each library are:
        Code:
        Z1, Z2, O1, O2, G1, G2, H1, H2, T1, T2, B1, B2, M1, M2
        21012147, 19924212, 26002900, 9660245, 17139388, 7649319, 16430105, 20101956, 12920266, 6306742, 44241095, 20094409, 15166090, 23203758
        I assume the normalized sum for each stage should be close to each other too as for replicates within stage. Not sure what is the function DESeq uses to get the point. edgeR expects provides the common.lib.size for this purpose, and I assume both edgeR and DESeq share the idea. Am I right?
        Actually I tried edgeR for the common.lib.size, but I never got similar numbers across the different stages, which is another reason I post this problem here.

        Did I miss anything e.g. only two conditions are allowed?
        As most of the experiments are for two conditions I tried only sub-section of my data with only two conditions, the result looked the same. Is there anything wrong with my data? But the raw counts number looks fine and other analysis indicated the data is fine.

        I tried different normalization methods (in edgeR TMM, quantile and RLE), or use the sum of each condition without replicate, but still could not get to the common scale with the comparable counts number for each stage. I have posted this question in bioconductor group, not yet get any reply from edgeR group. Appreciate if you could possibly address this point.
        Thank you!

        Yifang

        Originally posted by Simon Anders View Post
        You can divide your counts by the size factors:

        normalizedCounts <- t( t(counts(cds)) / sizeFactors(cds) )

        Now, the samples are on a common scale.

        Of course, these are not integer values. However, I cannot quite see why you would want to have integer values.

        Simon
        Last edited by yifangt; 06-20-2011, 06:02 PM.

        Comment


        • #5
          If you are suspicious that normalization might be off, the best is to compare two samples with an MA plot:

          Code:
          plot( 
             ( counts(cds)[,1] + counts(cds)[,2] )/2, 
             counts(cds)[,2] / counts(cds)[,1], 
             log="xy", pch="." )
          abline( h = sizeFactors(cds)[2] / sizeFactors(cds)[1] )
          If the size factors are fine, the horizontal line should disect the bulk of the points in the middle, especially at the high-expression end (right-hand side) of the plot.

          Comment


          • #6
            Re: Normalized counts between treatments.

            Thanks Simon!

            My suspicion is the normalization between the treatments rather than within treatment. I did the plotting for the cds from two replicates of the same treatment (i.e. within treatment, left image of the attachment) and one replicate each of two different treatments (between treatments, right image) as you suggested. The plot of the within treatment looks very fine, but the one between treatment does not. Is there anything wrong? This bugs me for a couple of days!

            Thanks again!

            Yifang
            Attached Files

            Comment


            • #7
              It's really hard to see how many points are plotted there on top of each other in your plots. Maybe try adding color="#00000040" to the plot command to add some transparancy, or use the smoothScatter function. Apart from that, it looks find to me, the horizontal line cuts through the middle of the big blob. Of course, you have more genes at the far bottom (outside the blob) than at the far top, but probably your treatment simply causes more downregulation than upregulation.

              Comment


              • #8
                Re: Normalized counts between treatments.

                Originally posted by Simon Anders View Post
                It's really hard to see how many points are plotted there on top of each other in your plots. Maybe try adding color="#00000040" to the plot command to add some transparancy, or use the smoothScatter function. Apart from that, it looks find to me, the horizontal line cuts through the middle of the big blob. Of course, you have more genes at the far bottom (outside the blob) than at the far top, but probably your treatment simply causes more downregulation than upregulation.
                Thanks Simon! This is a big relief to me.
                The dots are concentrated across the line after I add the transparency to the plot. However, my colleague argues about the fact that the number of counts for each library should come to close numbers, about which edgeR states several times that "the total counts in each libray of the pseudocounts agrees well with the common library size" (page 27 & 44 of the user's guide). But I can't see if DESeq has similar parameters behind.
                After I check the sums of baseMeanA and baseMeanB as normalized counts for each library, the numbers are still quite different, which is the same as the result from edgeR. So,
                1) Does my data violate the assumption of both packages?
                2) Does the normalized library size of the conditions not matter if they are different?
                3) Is the result still meaningful even the normalized library sizes are different?
                4) What could probably be the reason(s) to cause the normalized library size so different?
                5) Should I remove the smaller number reads as some other people do? After I removed the smaller numbers of counts (<=40 in >=6 out of 14 samples), the normalized library sizes become very similar.
                I can feel my lack of mathematics for the packages. Your explanation would be greatly appreciated. Thank you again!
                Yifang
                Attached Files

                Comment


                • #9
                  Originally posted by yifangt View Post
                  However, my colleague argues about the fact that the number of counts for each library should come to close numbers, about which edgeR states several times that "the total counts in each libray of the pseudocounts agrees well with the common library size" (page 27 & 44 of the user's guide).
                  If we expected that the proper normalization factors were close to the ratios of the total numbers of reads, we could use the latter directly. However, the whole point of not using them and instead estimating them from the per-gene counts is that, sometimes, they do not agree, and then, the latter is more reliable. Robinson and Oshlack have written a whole paper arguing this point, so maybe you want to ask you sceptical colleagues to read it. ;-)

                  Comment


                  • #10
                    Can we get the normalized count for each exon using the same approach in the newly released DEXseq package? In the visualization graph generated by DEXSeq, there are normalized counts drawn for each exon. Are those value derived from dividing raw counts by size factors?

                    thanks for your reply.

                    Originally posted by Simon Anders View Post
                    If we expected that the proper normalization factors were close to the ratios of the total numbers of reads, we could use the latter directly. However, the whole point of not using them and instead estimating them from the per-gene counts is that, sometimes, they do not agree, and then, the latter is more reliable. Robinson and Oshlack have written a whole paper arguing this point, so maybe you want to ask you sceptical colleagues to read it. ;-)

                    Comment


                    • #11
                      Hi xguo,

                      Yes, they are derived in the same way. In DEXSeq, you can get them using the parameter normalized=TRUE in the function 'counts':

                      counts(ecs, normalized=TRUE)

                      Cheers!

                      Comment


                      • #12
                        I have a question about normalization and downscaling:

                        I have 7 samples with the following number of aligned reads, based on samtools flagstat
                        Sample 1a: 50M, Sample 1b: 45M, Sample 2a: 95 M, Sample 2b: 45M, Sample 2c: 55M, Sample 3a: 115 M, Sample 3b: 30 M

                        When I use cds <- estimateSizeFactors, I get the following data:
                        Sample 1a: 0.8, Sample 1b: 0.7, Sample 2a: 1.55, Sample 2b: 0.75, Sample 2c: 0.90, Sample 3a: 1.92, Sample 3b: 0.9

                        The last sample of condition 3 has a high library size than I think it should. I went to a workshop and they suggested downscaling some data so they all match the number of reads. Is that what I should do? Does this data look peculiar?

                        Comment


                        • #13
                          And what did they mean by "downscaling"?

                          Comment


                          • #14
                            Hi Simon, I wasn't sure how they meant to do this, but the point was to essentially disregard some reads to get all of conditions to have roughly the same amount of reads before running DESeq. Have you not encountered this technique? Do you think it is a problem that one of my samples (Sample 3a) has 4X the depth of another (Sample 3b), but the size factor that DESeq outputs indicates only a twofold difference?

                            Comment


                            • #15
                              When I run DESeq and look at my dispersion estimates verus normalized expression, it seems that most genes have either very low or very high read counts even after normalization. I am normalizing data that consists of three time points (50 samples per time point). This looks very different from the dispersion plot in the manual. Is this an example where filtering out genes with very low read count would be a good idea? If not, is it worth it to normalize each time point separately? When I compare two samples with an MA plot , it seems fairly normal (though it is omitting about 6000 data points)
                              dispersion:
                              sample compare: http://tinypic.com/r/aadgxz/8

                              Edit: I ran it using cdsFilt and it looks much more like the example in the manual:
                              Last edited by bpb9; 11-20-2014, 09:23 PM. Reason: additional image, revised analysis

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              18 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              22 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              16 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              47 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X