Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    Hi peterawe,
    It should be applicable to miRNA experiments. The method is completely non-parametric and should be applicable to any sequencing experiment type. I had some email correspondence about a year ago with a researcher who was applying preseq to CAGE-seq data. He pointed out a bug in the implementation that we were able to fix. In his case three start positions made up ~90% of the library and he got good results. The preseq estimates accurately predicted the complexity for his deeper experiment. I don't see theoretically why your case would be any different.

    You can email me at [email protected] for specific questions and I can link you to the researcher's blog for his results.

    Comment


    • #32
      Someone was asking me about library size estimate. Google search directed me to this interesting thread. I have a question about your ZTNB model.

      I briefly skimmed through the supplementary note. On page S5 about ZTNB, what does "X" in "Pr(X=j)" stand for? Is Pr(X=j) the probability of seeing a molecule sequenced j times in BAM? If so, this is not a generalization of the Picard model.

      The Picard model is on the basis of this equation "C/X = 1 - exp(-N/X)", where "X = number of distinct molecules in library; N = number of read pairs; C = number of distinct fragments observed in read pairs". "N" is given in the BAM and "C" can be estimated with MarkDuplicates. Solving the equation numerically gives "X", the estimate we care about most. Is it the same as your ZTNB model when alpha takes zero?

      BTW, there are some discussions about Picard's model in this thread.

      Comment


      • #33
        I am aware of the model Picard uses, which is a simple zero-truncated Poisson. You can generalize the zero-truncated Poisson by letting the Poisson rates follow a Gamma distribution, which will result in a zero-truncated Negative Binomial model for the observed duplicate counts. So yes, this is a generalization of the model Picard uses and alpha = 0 will correspond to the zero-truncated Poisson.


        Originally posted by lh3 View Post
        what does "X" in "Pr(X=j)" stand for?
        X is a random variable representing the number of times an observed molecule is sequenced, conditional upon it being sequenced at least once. If Y_1, ...., Y_L are the number of times each molecule in the library is sequenced, then X_1, ....., X_D are the Y's that are greater than 0. Note that D < L and we don't see the Y's that are 0.

        There are major differences in how these things are implemented. The ZTP only requires the mean of the counts to fit the parameter, so Picard merely has to use distinct fraction to fit the parameter. The ZTNB requires the full vector of duplicate counts to fit the MLE. If you want to test out the ZTNB, we made an implementation available with our new R package, preseqR, available at http://cran.r-project.org/web/packag...eqR/index.html. This package is geared towards statisticians and ecologists and so the input is not a sequencing reads file but instead is the duplicate counts, which must be constructed separately.

        Comment


        • #34
          I see, it is conditioned. Then I understand why you say Picard's model is a special case of ZTNB.

          Another question. Why did you say in bw's plot preseq cannot be compared to picard? Isn't preseq's 'EXPECTED DISTINCT' for t->\infty also an estimator of the library size, which is supposed to be the same as picard's 'ESTIMATED_LIBRARY_SIZE'? If they are comparable, the instability of preseq's estimate is worrying. In practical, we mostly care about one number only: the number of distinct molecules in the library.

          In addition, do you consider optical duplicates (a constant number per run) and intrinsic mapping duplicates (ideally Poisson with typically smaller mean)? By mapping duplicates, I mean coordinate-based MarkDuplicates may falsely identify read pairs that happen to mapped with the same external coordinates; for human, compressed centromeres also contribute to false counts more or less. These other duplicates will all push the count distribution away from ZTP. Perhaps Picard does a better job, in terms of stability, because it excludes optical duplicates?
          Last edited by lh3; 10-09-2014, 12:59 PM.

          Comment


          • #35
            Originally posted by lh3 View Post
            Isn't preseq's 'EXPECTED DISTINCT' for t->\infty also an estimator of the library size, which is supposed to be the same as picard's 'ESTIMATED_LIBRARY_SIZE'?
            One would think so, and I initially did too, but the problems of estimating the library size and estimating the yield in distinct reads as a function of sequencing depth are quite different. One example: the latter has an unbiased estimator regardless of bias (i.e. the compounding distribution) and sequencing depth (i.e. t) but the former does not, as we briefly mentioned in the preseq paper. In fact, the library size is non-identifiable without severe restrictions on the compounding distribution (see http://www.jstor.org/stable/3695354, http://www.uni-marburg.de/fb12/stoch...olzmann/19.pdf, and https://www.uni-marburg.de/fb12/stoc...olzmann/14.pdf). Due to impossibility of unbiased estimators, most work is towards lower bounds (see http://cvcl.mit.edu/SUNSeminar/Bunge...trick_1993.pdf for an old review). We have developed a method for improved lower bounds and I can discuss it further outside this public forum.

            Originally posted by lh3 View Post
            In addition, do you consider optical duplicates (a constant number per run) and intrinsic mapping duplicates (ideally Poisson with typically smaller mean)?
            We don't consider either of these, since, as you say, they are constant per run. It is my opinion that these falls more under the purview of technological development. Intrinsic mapping duplicates is a problem for the mapping software, but shouldn't the same read give the same false mapping every time? If these are desired to be avoided, then shouldn't UMI's or SMT's be a part of the library preparation?

            Please email me at [email protected] and we can discuss the problem further. It's a problem that looks trivial from the outset, but gets incredibly deep and difficult as you peel the layers off.

            Comment


            • #36
              With the picard method, optical duplicates will lead to an underestimate of the library size. The lower the coverage, the higher the bias. To see this, suppose we have 0.5% optical duplicate rate independent of coverage. If we sequence to 0.1X coverage, 0.5% duplicate rate is pretty high, which would suggest a small library size. At higher coverage, PCR duplicates, which is loosely around 10% at 30X, dominate. The 0.5% optical duplicates won't matter much. The library size estimate will be larger. Will preseq be affected in a similar way? Removing optical duplicates helps to get a stable estimate.

              Nonetheless, if bw was randomly sampling from a BAM, the optical duplicate count distribution would vary with coverage. This would not explain why 'EXPECTED_DISTINCT' correlates with coverage. I am not sure the typical optical duplicate rate nowadays, either. It used to be visible.

              I can vaguely imagine that the library size is not identifiable unless we know more about the underlying distribution, but it would be still good to make additional assumptions (or based on data) about the distribution such that we can get a reasonable estimate of the library size.

              Thank you for the explanation anyway.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM
              • seqadmin
                Techniques and Challenges in Conservation Genomics
                by seqadmin



                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                Avian Conservation
                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                03-08-2024, 10:41 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Yesterday, 06:37 PM
              0 responses
              10 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, Yesterday, 06:07 PM
              0 responses
              9 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-22-2024, 10:03 AM
              0 responses
              49 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-21-2024, 07:32 AM
              0 responses
              67 views
              0 likes
              Last Post seqadmin  
              Working...
              X