SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Genomic DNA libraries using Nextera creeves Illumina/Solexa 3 03-02-2013 11:03 AM
Sequencing low complexity libraries: effects on data casbon Illumina/Solexa 7 09-05-2011 11:51 PM
PubMed: Orphelia: predicting genes in metagenomic sequencing reads. Newsbot! Literature Watch 0 05-12-2009 05:00 AM
question on solexa genomic dna libraries seqgirl123 Illumina/Solexa 5 01-23-2009 04:57 AM

Reply
 
Thread Tools
Old 04-22-2013, 10:04 AM   #21
aprilw
Junior Member
 
Location: Princeton, NJ

Join Date: May 2012
Posts: 8
Question lc_extrap question

Hi Tim,

Thank you for your quick reply. But I now have another question that I hope you can help me understand. When I run the lc_extrap module of your program on my sample I get the following error:

ERROR: too many iterations, poor sample

I have tried manipulating the number of bootstraps, but that does not seem to help. I'm afraid I don't understand enough about your program to understand what this error means for my sample.

Are there any parameters I can adjust to get any sort of result on approximately how much deeper I would need to sequence to saturate for discovery of new (for sample) exons?

Any insight you could offer would again be much appreciated. Thank you.
aprilw is offline   Reply With Quote
Old 06-15-2013, 05:26 PM   #22
peterch405
Junior Member
 
Location: Mars

Join Date: Jun 2013
Posts: 2
Default

Quote:
Originally Posted by timydaley View Post
Thanks kadircaner.

We did not include this functionality of this application to the method. It was an example for a broader application of the method. We could have also, for example, applied the method to predicting the number of exons observed in a RNA-seq experiment, where reads are binned by the exon in which the starting position of the read is contained in (with reads that don't map to exons thrown out). Or we apply to the method to predicting distinct alternative splicing events, with reads counted only if a splice is present and duplicates include not only duplicate reads but duplicate splicing events. All that matters is that each read is either identified with an event (i.e. exon or bin, but cannot be identified with more than one event) or is unidentifiable. Analogously, we could be thinking about the species problem discussed in the paper and counting copies of individuals, in which case we would be making inferences on the number of individuals, or counting copies of species, in which case we would be making inferences on the number of species.

If you want to apply this, message me and I can direct you how to download the software we used to bin the reads which takes as input a mapped and sorted bed file. A command line argument will then get you the number duplicates in a bin and this can be inputed in the newest version of the software, which allows you to just input a text file of duplicate counts. This may suffice in the meantime but we will consider including this as an option in future versions, thank you.
I have just started playing around with preseq and I would really like to know the method you used to bin reads. By the way congratulations on the nice publication.
peterch405 is offline   Reply With Quote
Old 06-17-2013, 06:37 PM   #23
timydaley
Member
 
Location: Los Angeles

Join Date: Jun 2010
Posts: 26
Default

In the publication we binned read by starting position, but other methods are equally applicable and we are playing around with other methods for different applications of the methods.
timydaley is offline   Reply With Quote
Old 06-19-2013, 10:00 AM   #24
peterch405
Junior Member
 
Location: Mars

Join Date: Jun 2013
Posts: 2
Default

Quote:
Originally Posted by timydaley View Post
In the publication we binned read by starting position, but other methods are equally applicable and we are playing around with other methods for different applications of the methods.
But what software were you referring to for the binning?
peterch405 is offline   Reply With Quote
Old 06-19-2013, 10:14 AM   #25
timydaley
Member
 
Location: Los Angeles

Join Date: Jun 2010
Posts: 26
Default

Quote:
Originally Posted by peterch405 View Post
But what software were you referring to for the binning?
We did the binning ourselves using just the starting position.
timydaley is offline   Reply With Quote
Old 09-05-2013, 06:17 PM   #26
timydaley
Member
 
Location: Los Angeles

Join Date: Jun 2010
Posts: 26
Default

Hello everyone,
We are announcing an update to the preseq software. Several new options were added upon request and reported bugs were fixed. Any other requests or ideas would be appreciated as we are not fully aware of the myriad of applications for our software.

New options:
  • -V option for input of a text file consisting of a single column of duplicate counts. This may, for example be the number of hits for each exon in an RNA-seq experiment (with the zero counts excluded) or the number of hits in non-overlapping bins. Several other examples of using such input is given in the updated manual. Due to vast number of applications of the preseq software, we leave the construction of the duplicate counts to the user.
  • -H option for input of a text file consisting of the duplicate count histogram.
  • -Q option for quick mode in lc_extrap. The library complexity is extrapolated using the observed duplicate counts. The duplicate count histogram is not bootstrapped for confidence intervals, saving significant computational time. Unfortunately confidence intervals will not be computed in quick mode.
  • Y50 statistic: If lc_extrap is run in verbose mode (-v option), the Y50 statistic is outputed to stderr. This is the number of reads needed to achieve 50% duplication level. The duplication level can be modified with the option -d. Y50 can be used to compare multiple libraries succinctly.

Bugs fixed:
  • lc_extrap now defaults to using the observed duplicate counts for extrapolation, rather than bootstrapping the histogram and taking the average curve. This has little effect on the estimates for most libraries but becomes a problem if one or a few counts are extremely large (as in > 1/4 of all reads correspond to a single loci or class). This was brought to my attention by Dave Tang with his analysis of a CAGE-seq experiment (see his blog entry http://davetang.org/muse/2013/07/10/...d-we-sequence/ for further explanation).
  • For GCC 4.7+, an error would arise due to the inheritance of getpid() was changed, resulting in a compiling error. This has been fixed.

Additionally the manual has been significantly expanded to include some analysis examples. Again input, advice, suggestions, issues, or problems would be appreciated. Thank you.
timydaley is offline   Reply With Quote
Old 10-29-2013, 10:24 AM   #27
bw.
Member
 
Location: San Francisco, CA

Join Date: Mar 2012
Posts: 21
Default Downsampling giving strange results

I ran a comparison to get a feel for Preseq and picard estimates using 5 real exome seq samples (50bp paired-end) with 50 million reads each. For each of the samples, I downsampled to between 10% and 90% of the original read counts in 10% increments by random sampling without replacement.
I then ran Preseq (with -P) and Picard EstimateLibraryComplexity on each of the 45 (5x9) resulting samples. The summary graphs are below.

One issue I ran into was that Preseq didn't work for 4 out of 5 samples with 5 million reads (10% downsampling) and 1 out of 5 of the 15 mil. read samples. The error message was "too many iterations, poor sample" - even though it worked for other downsampling %s of these samples.

I also tried running on 5 random exome seq samples from 1000 Genomes, and 2 of those gave the same error (HG00101 (~ 120 mil reads) and HG00105 (~ 60 mil reads)).

Unless I'm misunderstanding something, it looks like Preseq gives ~ 2x higher estimates than Picard (as discussed earlier in this thread), but Picard produces estimates that are much more stable across different read depths.

NOTE: the x-axis here counts left and right mates of a paired read separately, so the TOTAL_READS count is twice the actual number of paired-end reads.


The following shows Preseq error bars for one of the samples.

Last edited by bw.; 10-29-2013 at 11:55 AM.
bw. is offline   Reply With Quote
Old 10-29-2013, 12:16 PM   #28
timydaley
Member
 
Location: Los Angeles

Join Date: Jun 2010
Posts: 26
Default

Hi bw. I need to clear up some misconceptions first.

Picard's EstimateLibraryComplexity tries to compute the total number of distinct molecules in the library by the Lander-Waterman formula, which is equivalent to fitting a zero-truncated Poisson distribution to the observed duplicate counts (see ESTIMATED_LIBRARY_SIZE). This estimate will always be an extreme lower bound, regardless of the distribution of molecule abundances by Jensen's inequality, though better and simpler lower bounds exist (see Chao1987.pdf).

Quote:
Originally Posted by bw. View Post
Picard produce estimates that are much more stable across different downsamplings of a given sample.
What's the point of being stable when it's inaccurate? A perfectly stable estimate is 3 billion distinct molecules, one for every base in the genome. Might not be accurate, but it's certainly stable.

preseq's lc_extrap doesn't estimate the total number of distinct molecules, since in a non-parametric setting no unbiased estimator exists. Instead we try to predict the complexity for a fixed hypothetical level of sequencing to evaluate the benefit of additional sequencing and to decide how much sequencing to do by looking at the cost versus benefit. I'm unsure as to what you chose this point to be in the first figure and what you are comparing it too.

Quote:
Originally Posted by bw. View Post
One issue I ran into was that Preseq didn't work for 4 out of 5 samples with 5 million reads (10% downsampling) and 1 out of 5 of the 15 mil. read samples. The error message was "too many iterations, poor sample" - even though it worked for other downsampling %s of these samples.
preseq uses the observed duplicate count histogram to extrapolate the expected number of new distinct reads will be gained with additional sequencing. This will not work if the duplicate count histogram is not sufficiently full, which may be happening in the 2.5M read samples. Though the error message should be different. In the larger samples, the bootstrapping is failing to find acceptable rational function approximations to the complexity curve due to defects that are caused by the random nature of the input. We have made some recent progress on improving this and included an option (-Q) to skip the bootstrapping to perform a single extrapolation on the input histogram, though confidence intervals are then missing but tend to be overly large in our experience. Can you try the failed sample with the newest version? (source-code)

Thank you for your analysis. I hope I did not cause more confusion with my rambling. This is a deep and fascinating topic, though I definitely approach it from a way more mathematical standpoint. There may be some confusion since my language and view of the problem is fundamentally different than a biologist's. I hope my discussion with you will give me insight to your point of view and vice versa. Thank you bw..
timydaley is offline   Reply With Quote
Old 10-29-2013, 12:49 PM   #29
timydaley
Member
 
Location: Los Angeles

Join Date: Jun 2010
Posts: 26
Default

Oh, one thing I forgot to mention. preseq uses start position to identify duplicates and uses start and end position in paired end mode. Picard, I believe, uses the start position and first 5 bases of the reads () As we discuss in the paper, the method applies regardless of the method to identify duplicates. The choice of identification method is a whole other question, discussed in depth by Kivioja et al. (http://www.nature.com/nmeth/journal/...meth.1778.html).
timydaley is offline   Reply With Quote
Old 06-04-2014, 01:51 AM   #30
peterawe
Member
 
Location: Oslo, Norway

Join Date: Mar 2013
Posts: 11
Default

First of all, thanks for a very nice paper with detailed and informative supplementaries. I have a question regarding the application of your method to smallRNA/miRNA sequencing data sets. Naively, I would think that since you state that its applicable to ChIP-seq-data, it should also be useful for miRNA datasets as well, considering their similar extreme distribution properties (top 5 transcripts occupying more than 50% of the entire read-space).

However, since you don't mention it explicitly in either the manual of paper I wonder if you have given this any consideration? More specifically, I want to compare two different library preparation protocols on the same samples to compare how they perform in regards of sensitivity (~ complexity) for similar read depths. Sorry for the long question.
peterawe is offline   Reply With Quote
Old 06-04-2014, 06:09 AM   #31
timydaley
Member
 
Location: Los Angeles

Join Date: Jun 2010
Posts: 26
Default

Hi peterawe,
It should be applicable to miRNA experiments. The method is completely non-parametric and should be applicable to any sequencing experiment type. I had some email correspondence about a year ago with a researcher who was applying preseq to CAGE-seq data. He pointed out a bug in the implementation that we were able to fix. In his case three start positions made up ~90% of the library and he got good results. The preseq estimates accurately predicted the complexity for his deeper experiment. I don't see theoretically why your case would be any different.

You can email me at tdaley@usc.edu for specific questions and I can link you to the researcher's blog for his results.
timydaley is offline   Reply With Quote
Old 10-09-2014, 10:03 AM   #32
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

Someone was asking me about library size estimate. Google search directed me to this interesting thread. I have a question about your ZTNB model.

I briefly skimmed through the supplementary note. On page S5 about ZTNB, what does "X" in "Pr(X=j)" stand for? Is Pr(X=j) the probability of seeing a molecule sequenced j times in BAM? If so, this is not a generalization of the Picard model.

The Picard model is on the basis of this equation "C/X = 1 - exp(-N/X)", where "X = number of distinct molecules in library; N = number of read pairs; C = number of distinct fragments observed in read pairs". "N" is given in the BAM and "C" can be estimated with MarkDuplicates. Solving the equation numerically gives "X", the estimate we care about most. Is it the same as your ZTNB model when alpha takes zero?

BTW, there are some discussions about Picard's model in this thread.
lh3 is offline   Reply With Quote
Old 10-09-2014, 10:22 AM   #33
timydaley
Member
 
Location: Los Angeles

Join Date: Jun 2010
Posts: 26
Default

I am aware of the model Picard uses, which is a simple zero-truncated Poisson. You can generalize the zero-truncated Poisson by letting the Poisson rates follow a Gamma distribution, which will result in a zero-truncated Negative Binomial model for the observed duplicate counts. So yes, this is a generalization of the model Picard uses and alpha = 0 will correspond to the zero-truncated Poisson.


Quote:
Originally Posted by lh3 View Post
what does "X" in "Pr(X=j)" stand for?
X is a random variable representing the number of times an observed molecule is sequenced, conditional upon it being sequenced at least once. If Y_1, ...., Y_L are the number of times each molecule in the library is sequenced, then X_1, ....., X_D are the Y's that are greater than 0. Note that D < L and we don't see the Y's that are 0.

There are major differences in how these things are implemented. The ZTP only requires the mean of the counts to fit the parameter, so Picard merely has to use distinct fraction to fit the parameter. The ZTNB requires the full vector of duplicate counts to fit the MLE. If you want to test out the ZTNB, we made an implementation available with our new R package, preseqR, available at http://cran.r-project.org/web/packag...eqR/index.html. This package is geared towards statisticians and ecologists and so the input is not a sequencing reads file but instead is the duplicate counts, which must be constructed separately.
timydaley is offline   Reply With Quote
Old 10-09-2014, 12:57 PM   #34
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

I see, it is conditioned. Then I understand why you say Picard's model is a special case of ZTNB.

Another question. Why did you say in bw's plot preseq cannot be compared to picard? Isn't preseq's 'EXPECTED DISTINCT' for t->\infty also an estimator of the library size, which is supposed to be the same as picard's 'ESTIMATED_LIBRARY_SIZE'? If they are comparable, the instability of preseq's estimate is worrying. In practical, we mostly care about one number only: the number of distinct molecules in the library.

In addition, do you consider optical duplicates (a constant number per run) and intrinsic mapping duplicates (ideally Poisson with typically smaller mean)? By mapping duplicates, I mean coordinate-based MarkDuplicates may falsely identify read pairs that happen to mapped with the same external coordinates; for human, compressed centromeres also contribute to false counts more or less. These other duplicates will all push the count distribution away from ZTP. Perhaps Picard does a better job, in terms of stability, because it excludes optical duplicates?

Last edited by lh3; 10-09-2014 at 12:59 PM.
lh3 is offline   Reply With Quote
Old 10-09-2014, 06:33 PM   #35
timydaley
Member
 
Location: Los Angeles

Join Date: Jun 2010
Posts: 26
Default

Quote:
Originally Posted by lh3 View Post
Isn't preseq's 'EXPECTED DISTINCT' for t->\infty also an estimator of the library size, which is supposed to be the same as picard's 'ESTIMATED_LIBRARY_SIZE'?
One would think so, and I initially did too, but the problems of estimating the library size and estimating the yield in distinct reads as a function of sequencing depth are quite different. One example: the latter has an unbiased estimator regardless of bias (i.e. the compounding distribution) and sequencing depth (i.e. t) but the former does not, as we briefly mentioned in the preseq paper. In fact, the library size is non-identifiable without severe restrictions on the compounding distribution (see http://www.jstor.org/stable/3695354, http://www.uni-marburg.de/fb12/stoch...olzmann/19.pdf, and https://www.uni-marburg.de/fb12/stoc...olzmann/14.pdf). Due to impossibility of unbiased estimators, most work is towards lower bounds (see http://cvcl.mit.edu/SUNSeminar/Bunge...trick_1993.pdf for an old review). We have developed a method for improved lower bounds and I can discuss it further outside this public forum.

Quote:
Originally Posted by lh3 View Post
In addition, do you consider optical duplicates (a constant number per run) and intrinsic mapping duplicates (ideally Poisson with typically smaller mean)?
We don't consider either of these, since, as you say, they are constant per run. It is my opinion that these falls more under the purview of technological development. Intrinsic mapping duplicates is a problem for the mapping software, but shouldn't the same read give the same false mapping every time? If these are desired to be avoided, then shouldn't UMI's or SMT's be a part of the library preparation?

Please email me at tdaley@usc.edu and we can discuss the problem further. It's a problem that looks trivial from the outset, but gets incredibly deep and difficult as you peel the layers off.
timydaley is offline   Reply With Quote
Old 10-10-2014, 08:25 AM   #36
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

With the picard method, optical duplicates will lead to an underestimate of the library size. The lower the coverage, the higher the bias. To see this, suppose we have 0.5% optical duplicate rate independent of coverage. If we sequence to 0.1X coverage, 0.5% duplicate rate is pretty high, which would suggest a small library size. At higher coverage, PCR duplicates, which is loosely around 10% at 30X, dominate. The 0.5% optical duplicates won't matter much. The library size estimate will be larger. Will preseq be affected in a similar way? Removing optical duplicates helps to get a stable estimate.

Nonetheless, if bw was randomly sampling from a BAM, the optical duplicate count distribution would vary with coverage. This would not explain why 'EXPECTED_DISTINCT' correlates with coverage. I am not sure the typical optical duplicate rate nowadays, either. It used to be visible.

I can vaguely imagine that the library size is not identifiable unless we know more about the underlying distribution, but it would be still good to make additional assumptions (or based on data) about the distribution such that we can get a reasonable estimate of the library size.

Thank you for the explanation anyway.
lh3 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:53 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO