
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Genomic DNA libraries using Nextera  creeves  Illumina/Solexa  3  03022013 11:03 AM 
Sequencing low complexity libraries: effects on data  casbon  Illumina/Solexa  7  09052011 11:51 PM 
PubMed: Orphelia: predicting genes in metagenomic sequencing reads.  Newsbot!  Literature Watch  0  05122009 05:00 AM 
question on solexa genomic dna libraries  seqgirl123  Illumina/Solexa  5  01232009 04:57 AM 

Thread Tools 
04222013, 10:04 AM  #21 
Junior Member
Location: Princeton, NJ Join Date: May 2012
Posts: 8

lc_extrap question
Hi Tim,
Thank you for your quick reply. But I now have another question that I hope you can help me understand. When I run the lc_extrap module of your program on my sample I get the following error: ERROR: too many iterations, poor sample I have tried manipulating the number of bootstraps, but that does not seem to help. I'm afraid I don't understand enough about your program to understand what this error means for my sample. Are there any parameters I can adjust to get any sort of result on approximately how much deeper I would need to sequence to saturate for discovery of new (for sample) exons? Any insight you could offer would again be much appreciated. Thank you. 
06152013, 05:26 PM  #22  
Junior Member
Location: Mars Join Date: Jun 2013
Posts: 2

Quote:


06172013, 06:37 PM  #23 
Member
Location: Los Angeles Join Date: Jun 2010
Posts: 26

In the publication we binned read by starting position, but other methods are equally applicable and we are playing around with other methods for different applications of the methods.

06192013, 10:00 AM  #24 
Junior Member
Location: Mars Join Date: Jun 2013
Posts: 2


06192013, 10:14 AM  #25 
Member
Location: Los Angeles Join Date: Jun 2010
Posts: 26


09052013, 06:17 PM  #26 
Member
Location: Los Angeles Join Date: Jun 2010
Posts: 26

Hello everyone,
We are announcing an update to the preseq software. Several new options were added upon request and reported bugs were fixed. Any other requests or ideas would be appreciated as we are not fully aware of the myriad of applications for our software. New options:
Bugs fixed:
Additionally the manual has been significantly expanded to include some analysis examples. Again input, advice, suggestions, issues, or problems would be appreciated. Thank you. 
10292013, 10:24 AM  #27 
Member
Location: San Francisco, CA Join Date: Mar 2012
Posts: 21

Downsampling giving strange results
I ran a comparison to get a feel for Preseq and picard estimates using 5 real exome seq samples (50bp pairedend) with 50 million reads each. For each of the samples, I downsampled to between 10% and 90% of the original read counts in 10% increments by random sampling without replacement.
I then ran Preseq (with P) and Picard EstimateLibraryComplexity on each of the 45 (5x9) resulting samples. The summary graphs are below. One issue I ran into was that Preseq didn't work for 4 out of 5 samples with 5 million reads (10% downsampling) and 1 out of 5 of the 15 mil. read samples. The error message was "too many iterations, poor sample"  even though it worked for other downsampling %s of these samples. I also tried running on 5 random exome seq samples from 1000 Genomes, and 2 of those gave the same error (HG00101 (~ 120 mil reads) and HG00105 (~ 60 mil reads)). Unless I'm misunderstanding something, it looks like Preseq gives ~ 2x higher estimates than Picard (as discussed earlier in this thread), but Picard produces estimates that are much more stable across different read depths. NOTE: the xaxis here counts left and right mates of a paired read separately, so the TOTAL_READS count is twice the actual number of pairedend reads. The following shows Preseq error bars for one of the samples. Last edited by bw.; 10292013 at 11:55 AM. 
10292013, 12:16 PM  #28  
Member
Location: Los Angeles Join Date: Jun 2010
Posts: 26

Hi bw. I need to clear up some misconceptions first.
Picard's EstimateLibraryComplexity tries to compute the total number of distinct molecules in the library by the LanderWaterman formula, which is equivalent to fitting a zerotruncated Poisson distribution to the observed duplicate counts (see ESTIMATED_LIBRARY_SIZE). This estimate will always be an extreme lower bound, regardless of the distribution of molecule abundances by Jensen's inequality, though better and simpler lower bounds exist (see Chao1987.pdf). Quote:
preseq's lc_extrap doesn't estimate the total number of distinct molecules, since in a nonparametric setting no unbiased estimator exists. Instead we try to predict the complexity for a fixed hypothetical level of sequencing to evaluate the benefit of additional sequencing and to decide how much sequencing to do by looking at the cost versus benefit. I'm unsure as to what you chose this point to be in the first figure and what you are comparing it too. Quote:
Thank you for your analysis. I hope I did not cause more confusion with my rambling. This is a deep and fascinating topic, though I definitely approach it from a way more mathematical standpoint. There may be some confusion since my language and view of the problem is fundamentally different than a biologist's. I hope my discussion with you will give me insight to your point of view and vice versa. Thank you bw.. 

10292013, 12:49 PM  #29 
Member
Location: Los Angeles Join Date: Jun 2010
Posts: 26

Oh, one thing I forgot to mention. preseq uses start position to identify duplicates and uses start and end position in paired end mode. Picard, I believe, uses the start position and first 5 bases of the reads () As we discuss in the paper, the method applies regardless of the method to identify duplicates. The choice of identification method is a whole other question, discussed in depth by Kivioja et al. (http://www.nature.com/nmeth/journal/...meth.1778.html).

06042014, 01:51 AM  #30 
Member
Location: Oslo, Norway Join Date: Mar 2013
Posts: 11

First of all, thanks for a very nice paper with detailed and informative supplementaries. I have a question regarding the application of your method to smallRNA/miRNA sequencing data sets. Naively, I would think that since you state that its applicable to ChIPseqdata, it should also be useful for miRNA datasets as well, considering their similar extreme distribution properties (top 5 transcripts occupying more than 50% of the entire readspace).
However, since you don't mention it explicitly in either the manual of paper I wonder if you have given this any consideration? More specifically, I want to compare two different library preparation protocols on the same samples to compare how they perform in regards of sensitivity (~ complexity) for similar read depths. Sorry for the long question. 
06042014, 06:09 AM  #31 
Member
Location: Los Angeles Join Date: Jun 2010
Posts: 26

Hi peterawe,
It should be applicable to miRNA experiments. The method is completely nonparametric and should be applicable to any sequencing experiment type. I had some email correspondence about a year ago with a researcher who was applying preseq to CAGEseq data. He pointed out a bug in the implementation that we were able to fix. In his case three start positions made up ~90% of the library and he got good results. The preseq estimates accurately predicted the complexity for his deeper experiment. I don't see theoretically why your case would be any different. You can email me at tdaley@usc.edu for specific questions and I can link you to the researcher's blog for his results. 
10092014, 10:03 AM  #32 
Senior Member
Location: Boston Join Date: Feb 2008
Posts: 693

Someone was asking me about library size estimate. Google search directed me to this interesting thread. I have a question about your ZTNB model.
I briefly skimmed through the supplementary note. On page S5 about ZTNB, what does "X" in "Pr(X=j)" stand for? Is Pr(X=j) the probability of seeing a molecule sequenced j times in BAM? If so, this is not a generalization of the Picard model. The Picard model is on the basis of this equation "C/X = 1  exp(N/X)", where "X = number of distinct molecules in library; N = number of read pairs; C = number of distinct fragments observed in read pairs". "N" is given in the BAM and "C" can be estimated with MarkDuplicates. Solving the equation numerically gives "X", the estimate we care about most. Is it the same as your ZTNB model when alpha takes zero? BTW, there are some discussions about Picard's model in this thread. 
10092014, 10:22 AM  #33 
Member
Location: Los Angeles Join Date: Jun 2010
Posts: 26

I am aware of the model Picard uses, which is a simple zerotruncated Poisson. You can generalize the zerotruncated Poisson by letting the Poisson rates follow a Gamma distribution, which will result in a zerotruncated Negative Binomial model for the observed duplicate counts. So yes, this is a generalization of the model Picard uses and alpha = 0 will correspond to the zerotruncated Poisson.
X is a random variable representing the number of times an observed molecule is sequenced, conditional upon it being sequenced at least once. If Y_1, ...., Y_L are the number of times each molecule in the library is sequenced, then X_1, ....., X_D are the Y's that are greater than 0. Note that D < L and we don't see the Y's that are 0. There are major differences in how these things are implemented. The ZTP only requires the mean of the counts to fit the parameter, so Picard merely has to use distinct fraction to fit the parameter. The ZTNB requires the full vector of duplicate counts to fit the MLE. If you want to test out the ZTNB, we made an implementation available with our new R package, preseqR, available at http://cran.rproject.org/web/packag...eqR/index.html. This package is geared towards statisticians and ecologists and so the input is not a sequencing reads file but instead is the duplicate counts, which must be constructed separately. 
10092014, 12:57 PM  #34 
Senior Member
Location: Boston Join Date: Feb 2008
Posts: 693

I see, it is conditioned. Then I understand why you say Picard's model is a special case of ZTNB.
Another question. Why did you say in bw's plot preseq cannot be compared to picard? Isn't preseq's 'EXPECTED DISTINCT' for t>\infty also an estimator of the library size, which is supposed to be the same as picard's 'ESTIMATED_LIBRARY_SIZE'? If they are comparable, the instability of preseq's estimate is worrying. In practical, we mostly care about one number only: the number of distinct molecules in the library. In addition, do you consider optical duplicates (a constant number per run) and intrinsic mapping duplicates (ideally Poisson with typically smaller mean)? By mapping duplicates, I mean coordinatebased MarkDuplicates may falsely identify read pairs that happen to mapped with the same external coordinates; for human, compressed centromeres also contribute to false counts more or less. These other duplicates will all push the count distribution away from ZTP. Perhaps Picard does a better job, in terms of stability, because it excludes optical duplicates? Last edited by lh3; 10092014 at 12:59 PM. 
10092014, 06:33 PM  #35  
Member
Location: Los Angeles Join Date: Jun 2010
Posts: 26

Quote:
Quote:
Please email me at tdaley@usc.edu and we can discuss the problem further. It's a problem that looks trivial from the outset, but gets incredibly deep and difficult as you peel the layers off. 

10102014, 08:25 AM  #36 
Senior Member
Location: Boston Join Date: Feb 2008
Posts: 693

With the picard method, optical duplicates will lead to an underestimate of the library size. The lower the coverage, the higher the bias. To see this, suppose we have 0.5% optical duplicate rate independent of coverage. If we sequence to 0.1X coverage, 0.5% duplicate rate is pretty high, which would suggest a small library size. At higher coverage, PCR duplicates, which is loosely around 10% at 30X, dominate. The 0.5% optical duplicates won't matter much. The library size estimate will be larger. Will preseq be affected in a similar way? Removing optical duplicates helps to get a stable estimate.
Nonetheless, if bw was randomly sampling from a BAM, the optical duplicate count distribution would vary with coverage. This would not explain why 'EXPECTED_DISTINCT' correlates with coverage. I am not sure the typical optical duplicate rate nowadays, either. It used to be visible. I can vaguely imagine that the library size is not identifiable unless we know more about the underlying distribution, but it would be still good to make additional assumptions (or based on data) about the distribution such that we can get a reasonable estimate of the library size. Thank you for the explanation anyway. 
Thread Tools  

