Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    Hi peterawe,
    It should be applicable to miRNA experiments. The method is completely non-parametric and should be applicable to any sequencing experiment type. I had some email correspondence about a year ago with a researcher who was applying preseq to CAGE-seq data. He pointed out a bug in the implementation that we were able to fix. In his case three start positions made up ~90% of the library and he got good results. The preseq estimates accurately predicted the complexity for his deeper experiment. I don't see theoretically why your case would be any different.

    You can email me at [email protected] for specific questions and I can link you to the researcher's blog for his results.

    Comment


    • #32
      Someone was asking me about library size estimate. Google search directed me to this interesting thread. I have a question about your ZTNB model.

      I briefly skimmed through the supplementary note. On page S5 about ZTNB, what does "X" in "Pr(X=j)" stand for? Is Pr(X=j) the probability of seeing a molecule sequenced j times in BAM? If so, this is not a generalization of the Picard model.

      The Picard model is on the basis of this equation "C/X = 1 - exp(-N/X)", where "X = number of distinct molecules in library; N = number of read pairs; C = number of distinct fragments observed in read pairs". "N" is given in the BAM and "C" can be estimated with MarkDuplicates. Solving the equation numerically gives "X", the estimate we care about most. Is it the same as your ZTNB model when alpha takes zero?

      BTW, there are some discussions about Picard's model in this thread.

      Comment


      • #33
        I am aware of the model Picard uses, which is a simple zero-truncated Poisson. You can generalize the zero-truncated Poisson by letting the Poisson rates follow a Gamma distribution, which will result in a zero-truncated Negative Binomial model for the observed duplicate counts. So yes, this is a generalization of the model Picard uses and alpha = 0 will correspond to the zero-truncated Poisson.


        Originally posted by lh3 View Post
        what does "X" in "Pr(X=j)" stand for?
        X is a random variable representing the number of times an observed molecule is sequenced, conditional upon it being sequenced at least once. If Y_1, ...., Y_L are the number of times each molecule in the library is sequenced, then X_1, ....., X_D are the Y's that are greater than 0. Note that D < L and we don't see the Y's that are 0.

        There are major differences in how these things are implemented. The ZTP only requires the mean of the counts to fit the parameter, so Picard merely has to use distinct fraction to fit the parameter. The ZTNB requires the full vector of duplicate counts to fit the MLE. If you want to test out the ZTNB, we made an implementation available with our new R package, preseqR, available at http://cran.r-project.org/web/packag...eqR/index.html. This package is geared towards statisticians and ecologists and so the input is not a sequencing reads file but instead is the duplicate counts, which must be constructed separately.

        Comment


        • #34
          I see, it is conditioned. Then I understand why you say Picard's model is a special case of ZTNB.

          Another question. Why did you say in bw's plot preseq cannot be compared to picard? Isn't preseq's 'EXPECTED DISTINCT' for t->\infty also an estimator of the library size, which is supposed to be the same as picard's 'ESTIMATED_LIBRARY_SIZE'? If they are comparable, the instability of preseq's estimate is worrying. In practical, we mostly care about one number only: the number of distinct molecules in the library.

          In addition, do you consider optical duplicates (a constant number per run) and intrinsic mapping duplicates (ideally Poisson with typically smaller mean)? By mapping duplicates, I mean coordinate-based MarkDuplicates may falsely identify read pairs that happen to mapped with the same external coordinates; for human, compressed centromeres also contribute to false counts more or less. These other duplicates will all push the count distribution away from ZTP. Perhaps Picard does a better job, in terms of stability, because it excludes optical duplicates?
          Last edited by lh3; 10-09-2014, 12:59 PM.

          Comment


          • #35
            Originally posted by lh3 View Post
            Isn't preseq's 'EXPECTED DISTINCT' for t->\infty also an estimator of the library size, which is supposed to be the same as picard's 'ESTIMATED_LIBRARY_SIZE'?
            One would think so, and I initially did too, but the problems of estimating the library size and estimating the yield in distinct reads as a function of sequencing depth are quite different. One example: the latter has an unbiased estimator regardless of bias (i.e. the compounding distribution) and sequencing depth (i.e. t) but the former does not, as we briefly mentioned in the preseq paper. In fact, the library size is non-identifiable without severe restrictions on the compounding distribution (see http://www.jstor.org/stable/3695354, http://www.uni-marburg.de/fb12/stoch...olzmann/19.pdf, and https://www.uni-marburg.de/fb12/stoc...olzmann/14.pdf). Due to impossibility of unbiased estimators, most work is towards lower bounds (see http://cvcl.mit.edu/SUNSeminar/Bunge...trick_1993.pdf for an old review). We have developed a method for improved lower bounds and I can discuss it further outside this public forum.

            Originally posted by lh3 View Post
            In addition, do you consider optical duplicates (a constant number per run) and intrinsic mapping duplicates (ideally Poisson with typically smaller mean)?
            We don't consider either of these, since, as you say, they are constant per run. It is my opinion that these falls more under the purview of technological development. Intrinsic mapping duplicates is a problem for the mapping software, but shouldn't the same read give the same false mapping every time? If these are desired to be avoided, then shouldn't UMI's or SMT's be a part of the library preparation?

            Please email me at [email protected] and we can discuss the problem further. It's a problem that looks trivial from the outset, but gets incredibly deep and difficult as you peel the layers off.

            Comment


            • #36
              With the picard method, optical duplicates will lead to an underestimate of the library size. The lower the coverage, the higher the bias. To see this, suppose we have 0.5% optical duplicate rate independent of coverage. If we sequence to 0.1X coverage, 0.5% duplicate rate is pretty high, which would suggest a small library size. At higher coverage, PCR duplicates, which is loosely around 10% at 30X, dominate. The 0.5% optical duplicates won't matter much. The library size estimate will be larger. Will preseq be affected in a similar way? Removing optical duplicates helps to get a stable estimate.

              Nonetheless, if bw was randomly sampling from a BAM, the optical duplicate count distribution would vary with coverage. This would not explain why 'EXPECTED_DISTINCT' correlates with coverage. I am not sure the typical optical duplicate rate nowadays, either. It used to be visible.

              I can vaguely imagine that the library size is not identifiable unless we know more about the underlying distribution, but it would be still good to make additional assumptions (or based on data) about the distribution such that we can get a reasonable estimate of the library size.

              Thank you for the explanation anyway.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Recent Innovations in Spatial Biology
                by seqadmin


                Spatial biology is an exciting field that encompasses a wide range of techniques and technologies aimed at mapping the organization and interactions of various biomolecules in their native environments. As this area of research progresses, new tools and methodologies are being introduced, accompanied by efforts to establish benchmarking standards and drive technological innovation.

                3D Genomics
                While spatial biology often involves studying proteins and RNAs in their...
                01-01-2025, 07:30 PM
              • seqadmin
                Advancing Precision Medicine for Rare Diseases in Children
                by seqadmin




                Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                12-16-2024, 07:57 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 01-09-2025, 04:04 PM
              0 responses
              433 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 01-09-2025, 09:42 AM
              0 responses
              441 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 01-08-2025, 03:17 PM
              0 responses
              457 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 01-03-2025, 11:18 AM
              1 response
              50 views
              1 like
              Last Post Tonia
              by Tonia
               
              Working...
              X