Seqanswers Leaderboard Ad

**timydaley** · 04-01-2013, 10:54 AM

Originally posted by Emo.Zhao View Post

So the overestimate of ZTNB must be the result of second parameter of negative binomial not being estimated correctly from data?

That is assuming the data is Negative Binomially distributed, which we know it is not. The ZTNB parameter estimation was naturally tested against ZTNB distributed data to ensure the EM algorithm worked correctly. We do know that the simple Poisson model will consistently underestimate the complexity due to the strict concavity of the objective function (1 - e^{- t * lambda}) and applying Jensen's inequality. The ZTNB will tend to underestimate, and this has been observed in the capture-recapture literature, but there is no guarantee since we cannot apply Jensen in this case.

**kadircaner** · 04-03-2013, 12:59 PM

Hi Timothy,

Congrats on your paper and thanks a lot for such a useful tool!!

I have a question about enriched genomic windows, in your paper you said windows could be a better read out compared to distinct reads for library saturation/complexity.

But you did not mention how one can use your tool to perform such analysis... Do you use another tool for testing enriched genomic windows or there is hidden parameter in the tool?

Thanks in advance...

**timydaley** · 04-03-2013, 01:53 PM

Thanks kadircaner.

We did not include this functionality of this application to the method. It was an example for a broader application of the method. We could have also, for example, applied the method to predicting the number of exons observed in a RNA-seq experiment, where reads are binned by the exon in which the starting position of the read is contained in (with reads that don't map to exons thrown out). Or we apply to the method to predicting distinct alternative splicing events, with reads counted only if a splice is present and duplicates include not only duplicate reads but duplicate splicing events. All that matters is that each read is either identified with an event (i.e. exon or bin, but cannot be identified with more than one event) or is unidentifiable. Analogously, we could be thinking about the species problem discussed in the paper and counting copies of individuals, in which case we would be making inferences on the number of individuals, or counting copies of species, in which case we would be making inferences on the number of species.

If you want to apply this, message me and I can direct you how to download the software we used to bin the reads which takes as input a mapped and sorted bed file. A command line argument will then get you the number duplicates in a bin and this can be inputed in the newest version of the software, which allows you to just input a text file of duplicate counts. This may suffice in the meantime but we will consider including this as an option in future versions, thank you.

**aprilw** · 04-17-2013, 10:25 AM

Loading count table

Hi! Great paper!

I have a question about how to use preseq with the table of counts.

I have generated a table from our RNA-Seq experiments that denotes how many reads map to each exon regardless of whether or not the reads are exactly the same.

When I try to load these counts as a text file with the -V option I get the following error:
INVALID INPUT 0
ERROR: ERROR IN INPUT

I think I'm misunderstanding what you mean by "input is a text file of counts in a single column" and "one count per line" in the manual. What I've been inputting is simply a single column list of counts, for example:

610
0
2554
109
4
6
...

Where each row is a separate exon, some exons are not observed in this experiment and/or at this level of coverage so they have 0 values.

Some clarification to what exactly I've misunderstood would be much appreciated.

Thanks in advance!
April

**timydaley** · 04-17-2013, 05:05 PM

April,
Thank you for using our software. It seems that you generated the counts correctly except for including the zero counts. Our software does not take 0 values as input. This is because you can't be sure if, in your case, you didn't see an exon because it wasn't included in any expressed transcript and isn't in the library, or if the exon is contained in the library and was missed due to low sequencing depth or random chance. If you just take the zero counts out, it should work fine.

**aprilw** · 04-22-2013, 10:04 AM

lc_extrap question

Hi Tim,

Thank you for your quick reply. But I now have another question that I hope you can help me understand. When I run the lc_extrap module of your program on my sample I get the following error:

ERROR: too many iterations, poor sample

I have tried manipulating the number of bootstraps, but that does not seem to help. I'm afraid I don't understand enough about your program to understand what this error means for my sample.

Are there any parameters I can adjust to get any sort of result on approximately how much deeper I would need to sequence to saturate for discovery of new (for sample) exons?

Any insight you could offer would again be much appreciated. Thank you.

**peterch405** · 06-15-2013, 05:26 PM

Originally posted by timydaley View Post

Thanks kadircaner.

We did not include this functionality of this application to the method. It was an example for a broader application of the method. We could have also, for example, applied the method to predicting the number of exons observed in a RNA-seq experiment, where reads are binned by the exon in which the starting position of the read is contained in (with reads that don't map to exons thrown out). Or we apply to the method to predicting distinct alternative splicing events, with reads counted only if a splice is present and duplicates include not only duplicate reads but duplicate splicing events. All that matters is that each read is either identified with an event (i.e. exon or bin, but cannot be identified with more than one event) or is unidentifiable. Analogously, we could be thinking about the species problem discussed in the paper and counting copies of individuals, in which case we would be making inferences on the number of individuals, or counting copies of species, in which case we would be making inferences on the number of species.

If you want to apply this, message me and I can direct you how to download the software we used to bin the reads which takes as input a mapped and sorted bed file. A command line argument will then get you the number duplicates in a bin and this can be inputed in the newest version of the software, which allows you to just input a text file of duplicate counts. This may suffice in the meantime but we will consider including this as an option in future versions, thank you.

I have just started playing around with preseq and I would really like to know the method you used to bin reads. By the way congratulations on the nice publication.

**timydaley** · 06-17-2013, 06:37 PM

In the publication we binned read by starting position, but other methods are equally applicable and we are playing around with other methods for different applications of the methods.

**peterch405** · 06-19-2013, 10:00 AM

Originally posted by timydaley View Post

In the publication we binned read by starting position, but other methods are equally applicable and we are playing around with other methods for different applications of the methods.

But what software were you referring to for the binning?

**timydaley** · 06-19-2013, 10:14 AM

Originally posted by peterch405 View Post

But what software were you referring to for the binning?

We did the binning ourselves using just the starting position.

**timydaley** · 09-05-2013, 06:17 PM

Hello everyone,
We are announcing an update to the preseq software. Several new options were added upon request and reported bugs were fixed. Any other requests or ideas would be appreciated as we are not fully aware of the myriad of applications for our software.

New options:

-V option for input of a text file consisting of a single column of duplicate counts. This may, for example be the number of hits for each exon in an RNA-seq experiment (with the zero counts excluded) or the number of hits in non-overlapping bins. Several other examples of using such input is given in the updated manual. Due to vast number of applications of the preseq software, we leave the construction of the duplicate counts to the user.
-H option for input of a text file consisting of the duplicate count histogram.
-Q option for quick mode in lc_extrap. The library complexity is extrapolated using the observed duplicate counts. The duplicate count histogram is not bootstrapped for confidence intervals, saving significant computational time. Unfortunately confidence intervals will not be computed in quick mode.
Y50 statistic: If lc_extrap is run in verbose mode (-v option), the Y50 statistic is outputed to stderr. This is the number of reads needed to achieve 50% duplication level. The duplication level can be modified with the option -d. Y50 can be used to compare multiple libraries succinctly.

Bugs fixed:

lc_extrap now defaults to using the observed duplicate counts for extrapolation, rather than bootstrapping the histogram and taking the average curve. This has little effect on the estimates for most libraries but becomes a problem if one or a few counts are extremely large (as in > 1/4 of all reads correspond to a single loci or class). This was brought to my attention by Dave Tang with his analysis of a CAGE-seq experiment (see his blog entry http://davetang.org/muse/2013/07/10/...d-we-sequence/ for further explanation).
For GCC 4.7+, an error would arise due to the inheritance of getpid() was changed, resulting in a compiling error. This has been fixed.

Additionally the manual has been significantly expanded to include some analysis examples. Again input, advice, suggestions, issues, or problems would be appreciated. Thank you.

**bw.** · 10-29-2013, 10:24 AM

Downsampling giving strange results

I ran a comparison to get a feel for Preseq and picard estimates using 5 real exome seq samples (50bp paired-end) with 50 million reads each. For each of the samples, I downsampled to between 10% and 90% of the original read counts in 10% increments by random sampling without replacement.
I then ran Preseq (with -P) and Picard EstimateLibraryComplexity on each of the 45 (5x9) resulting samples. The summary graphs are below.

One issue I ran into was that Preseq didn't work for 4 out of 5 samples with 5 million reads (10% downsampling) and 1 out of 5 of the 15 mil. read samples. The error message was "too many iterations, poor sample" - even though it worked for other downsampling %s of these samples.

I also tried running on 5 random exome seq samples from 1000 Genomes, and 2 of those gave the same error (HG00101 (~ 120 mil reads) and HG00105 (~ 60 mil reads)).

Unless I'm misunderstanding something, it looks like Preseq gives ~ 2x higher estimates than Picard (as discussed earlier in this thread), but Picard produces estimates that are much more stable across different read depths.

NOTE: the x-axis here counts left and right mates of a paired read separately, so the TOTAL_READS count is twice the actual number of paired-end reads.

The following shows Preseq error bars for one of the samples.

**timydaley** · 10-29-2013, 12:16 PM

Hi bw. I need to clear up some misconceptions first.

Picard's EstimateLibraryComplexity tries to compute the total number of distinct molecules in the library by the Lander-Waterman formula, which is equivalent to fitting a zero-truncated Poisson distribution to the observed duplicate counts (see ESTIMATED_LIBRARY_SIZE). This estimate will always be an extreme lower bound, regardless of the distribution of molecule abundances by Jensen's inequality, though better and simpler lower bounds exist (see Chao1987.pdf).

Originally posted by bw. View Post

Picard produce estimates that are much more stable across different downsamplings of a given sample.

What's the point of being stable when it's inaccurate? A perfectly stable estimate is 3 billion distinct molecules, one for every base in the genome. Might not be accurate, but it's certainly stable.

preseq's lc_extrap doesn't estimate the total number of distinct molecules, since in a non-parametric setting no unbiased estimator exists. Instead we try to predict the complexity for a fixed hypothetical level of sequencing to evaluate the benefit of additional sequencing and to decide how much sequencing to do by looking at the cost versus benefit. I'm unsure as to what you chose this point to be in the first figure and what you are comparing it too.

Originally posted by bw. View Post

One issue I ran into was that Preseq didn't work for 4 out of 5 samples with 5 million reads (10% downsampling) and 1 out of 5 of the 15 mil. read samples. The error message was "too many iterations, poor sample" - even though it worked for other downsampling %s of these samples.

preseq uses the observed duplicate count histogram to extrapolate the expected number of new distinct reads will be gained with additional sequencing. This will not work if the duplicate count histogram is not sufficiently full, which may be happening in the 2.5M read samples. Though the error message should be different. In the larger samples, the bootstrapping is failing to find acceptable rational function approximations to the complexity curve due to defects that are caused by the random nature of the input. We have made some recent progress on improving this and included an option (-Q) to skip the bootstrapping to perform a single extrapolation on the input histogram, though confidence intervals are then missing but tend to be overly large in our experience. Can you try the failed sample with the newest version? (source-code)

Thank you for your analysis. I hope I did not cause more confusion with my rambling. This is a deep and fascinating topic, though I definitely approach it from a way more mathematical standpoint. There may be some confusion since my language and view of the problem is fundamentally different than a biologist's. I hope my discussion with you will give me insight to your point of view and vice versa. Thank you bw..

**timydaley** · 10-29-2013, 12:49 PM

Oh, one thing I forgot to mention. preseq uses start position to identify duplicates and uses start and end position in paired end mode. Picard, I believe, uses the start position and first 5 bases of the reads () As we discuss in the paper, the method applies regardless of the method to identify duplicates. The choice of identification method is a whole other question, discussed in depth by Kivioja et al. (http://www.nature.com/nmeth/journal/...meth.1778.html).

**peterawe** · 06-04-2014, 01:51 AM

First of all, thanks for a very nice paper with detailed and informative supplementaries. I have a question regarding the application of your method to smallRNA/miRNA sequencing data sets. Naively, I would think that since you state that its applicable to ChIP-seq-data, it should also be useful for miRNA datasets as well, considering their similar extreme distribution properties (top 5 transcripts occupying more than 50% of the entire read-space).

However, since you don't mention it explicitly in either the manual of paper I wonder if you have given this any consideration? More specifically, I want to compare two different library preparation protocols on the same samples to compare how they perform in regards of sensitivity (~ complexity) for similar read depths. Sorry for the long question.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 27 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 31 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 27 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News