SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Genomic DNA libraries using Nextera creeves Illumina/Solexa 3 03-02-2013 11:03 AM
Sequencing low complexity libraries: effects on data casbon Illumina/Solexa 7 09-05-2011 11:51 PM
PubMed: Orphelia: predicting genes in metagenomic sequencing reads. Newsbot! Literature Watch 0 05-12-2009 05:00 AM
question on solexa genomic dna libraries seqgirl123 Illumina/Solexa 5 01-23-2009 04:57 AM

Reply
 
Thread Tools
Old 02-25-2013, 10:46 AM   #1
timydaley
Member
 
Location: Los Angeles

Join Date: Jun 2010
Posts: 26
Post preseq: predicting the complexity of genomic sequencing libraries

It is the pleasure of the Smith lab at USC to announce the publication of the preseq manuscript in Nature Methods, currently available as Advanced Online Publication ( link ).

The preseq software is a tool designed to predict the complexity of a genomic library, quantified as the number of distinct reads obtained for a hypothetical sequencing depth. It takes as input a small mapped initial sequencing experiment in either BED or BAM format sorted by mapping location. The idea is to use the information gained from the number of times each read is observed to predict the number of new, currently unobserved, reads that will be gained from additional sequencing. The underlying model is taken from capture-recapture statistics and is shown to be applicable to a broad range of sequencing experiments, including RNA-seq, ChIP-seq, BS-seq, and Exome-seq. Our lab and our collaborators currently use preseq as part of our bisulfite sequencing pipeline to prevent deep sequencing of low complexity libraries and to optimise the depth of sequencing.

For more information or to download the software, please visit the preseq website:

http://smithlab.usc.edu/software/librarycomplexity
timydaley is offline   Reply With Quote
Old 02-25-2013, 01:05 PM   #2
dietmar13
Senior Member
 
Location: Vienna

Join Date: Mar 2010
Posts: 107
Default problem running with BAM-file

hi timydaley,

i have a problem to run your program with a bam file (Scientific Linux 6.3 = RHEL 6.3).

gsl 1.15 and bamtools are installed.
your program also
Code:
make all BAMTOOLS_ROOT=/home/ws/SW_install/bamtools/bamtools
Code:
g++ -Wall -fPIC -fmessage-length=50 -DHAVE_BAMTOOLS -c -o continued_fraction.o continued_fraction.cpp -I/home/ws/SW_install/presec/PRE-seq/smithlab_cpp/ -I/home/ws/SW_install/bamtools/bamtools/include
g++ -Wall -fPIC -fmessage-length=50 -DHAVE_BAMTOOLS -o lc_extrap lc_extrap.cpp /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//GenomicRegion.o /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//smithlab_os.o /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//smithlab_utils.o /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//OptionParser.o /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//MappedRead.o /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//RNG.o continued_fraction.o -I/home/ws/SW_install/presec/PRE-seq/smithlab_cpp/ -I/home/ws/SW_install/bamtools/bamtools/include -lgsl -lgslcblas  -L/home/ws/SW_install/bamtools/bamtools/lib -lz -lbamtools
g++ -Wall -fPIC -fmessage-length=50 -DHAVE_BAMTOOLS -o c_curve c_curve.cpp /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//GenomicRegion.o /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//smithlab_os.o /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//smithlab_utils.o /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//OptionParser.o /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//MappedRead.o /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//RNG.o -I/home/ws/SW_install/presec/PRE-seq/smithlab_cpp/ -I/home/ws/SW_install/bamtools/bamtools/include -lgsl -lgslcblas  -L/home/ws/SW_install/bamtools/bamtools/lib -lz -lbamtools
but running the executables yields following error:

Code:
./c_curve: error while loading shared libraries: libbamtools.so.2.2.3: cannot open shared object file: No such file or directory
./lc_extrap: error while loading shared libraries: libbamtools.so.2.2.3: cannot open shared object file: No such file or directory
what do i miss...

thank you!
dietmar13 is offline   Reply With Quote
Old 02-25-2013, 01:23 PM   #3
timydaley
Member
 
Location: Los Angeles

Join Date: Jun 2010
Posts: 26
Default

Ah, yes. This a common problem we encounter with bamtools since we do not do static linking and all the problems that entails. I know of two quick solutions: have your LD_LIBRARY_PATH set to BAMTOOLS_ROOT/lib/, but then you will have problems with other programs or you can copy all the lib files in BAMTOOLS_ROOT/lib/ to your current LD_LIBRARY_PATH, typically ~/lib/.

Thank you so much for using our software and I hope it is of use in your projects.
timydaley is offline   Reply With Quote
Old 02-25-2013, 02:11 PM   #4
kopi-o
Senior Member
 
Location: Stockholm, Sweden

Join Date: Feb 2008
Posts: 319
Default

This seems like a very interesting tool, thanks for sharing.
kopi-o is offline   Reply With Quote
Old 03-29-2013, 04:03 AM   #5
kristofit
Junior Member
 
Location: france

Join Date: Apr 2012
Posts: 3
Default

Dear Timothy

Thank you for your developing the method and for implementing and sharing the tool. It is a great opportunity to assess the libraries complexity and to have a statistical support to know more about what to expect with deeper sequencing.

I've run your tool to produce c_curve and lc_extrap output (using .bam from a RNAseq experiment). It worked just fine with default parameters.

I have 2 comments/questions:
1- How can we estimate the library complexity in term of distinct loci identified ? (It seems possible to compute it but I could not find how to do in the manual.pdf)

2- could you help to interpret the c_curve out file (-verbose) for "DISTINCT COUNTS/MAX COUNT/COUNTS OF 1/MAX TERMS" ?
(my example)
TOTAL READS = 41159280
DISTINCT READS = 16910777
DISTINCT COUNTS = 1258
MAX COUNT = 16177
COUNTS OF 1 = 1.1352e+07
MAX TERMS = 1000
OBSERVED COUNTS (16178)

Thank you for your time
kristofit is offline   Reply With Quote
Old 03-29-2013, 08:41 AM   #6
timydaley
Member
 
Location: Los Angeles

Join Date: Jun 2010
Posts: 26
Default

Quote:
Originally Posted by kristofit View Post

1- How can we estimate the library complexity in term of distinct loci identified ? (It seems possible to compute it but I could not find how to do in the manual.pdf)
To compute estimates in terms of distinct loci identified you need to produce the counts of duplicate events in terms of the loci of interest. For example, you may be interested in counting distinct exons for a RNA-seq experiment. The UMI would then be the exon to which a read starts at (or ends at, or is completely contained within, either way the UMI needs to be consistent across the initial and full experiment). Duplicate events would be reads that map to the same exon, regardless if they themselves are duplicates or not.

Or take the example from the paper, where we looked at fixed non-overlapping 1kb bins for a ChIP-seq experiment. For that we used the mapped starting location of the read to identify the bin to which a read belonged to. Duplicate events include distinct reads that map to the same bin and duplicate reads.

Once you have the duplicate counts, the newest version of preseq (available on our website smithlab.usc.edu/sorftware/librarycomplexity ) allows you to input this as a text file, one count per line as detailed in the manual.


Quote:
Originally Posted by kristofit View Post

2- could you help to interpret the c_curve out file (-verbose) for "DISTINCT COUNTS/MAX COUNT/COUNTS OF 1/MAX TERMS" ?
(my example)
TOTAL READS = 41159280
DISTINCT READS = 16910777
DISTINCT COUNTS = 1258
MAX COUNT = 16177
COUNTS OF 1 = 1.1352e+07
MAX TERMS = 1000
OBSERVED COUNTS (16178)
Well, you had ~41M mapped reads in the experiment, ~17M distinct reads,
11M singleton reads, and the read with the largest number of duplicates had 16177 copies.

Thank you for using the software and if you have anymore questions, feel free to contact me.

Last edited by timydaley; 03-29-2013 at 03:22 PM.
timydaley is offline   Reply With Quote
Old 03-30-2013, 03:20 PM   #7
Emo.Zhao
Junior Member
 
Location: US

Join Date: Oct 2011
Posts: 3
Default Comparison with Picard EstimateLibraryCompleixty?

Hi Tim,

Congratulations on a very nice paper, I thought you laid out the problem very nicely and provided enough details of the method to be very convincing. I'm curious about how much estimates from your method differs from those provided by tools commonly used today, such as EstimateLibraryComplexity.jar in picard. It uses a formula that only uses total number of reads in sample and total number of unique reads: http://picard.sourceforge.net/javado...e(long,%20long)

I'm guessing picard's estimate is probably even more biased than ZTNB model, but I don't know how far off it actually is. It'll be very interesting to see how it performs in your comparisons.

Thanks for the nice paper/software!
Emo.Zhao is offline   Reply With Quote
Old 03-30-2013, 03:29 PM   #8
timydaley
Member
 
Location: Los Angeles

Join Date: Jun 2010
Posts: 26
Default

Thank you Mr. Zhao. It appears that estimateLibrarySize assumes a simple Lander-Waterman model, which would correspond to a simple Poisson model. The ZTNB model is much broader class that includes the simple Poisson model (taking alpha -> 0). Therefore, the estimates from such a model can only be more biased than the ZTNB estimates.

If you have any more questions or would like to discuss the model, feel free to contact me. I am happy to answer any questions that I am able to.
timydaley is offline   Reply With Quote
Old 03-30-2013, 03:55 PM   #9
Emo.Zhao
Junior Member
 
Location: US

Join Date: Oct 2011
Posts: 3
Default

Ok, so my question is a practical one: our pipeline uses picard markduplicates and estimate library complexity. The estimate may not be very accurate, but it's basically for free computationally. Preseq is more accurate, but running it on bams with ~ 100 million pair end reads is takes a long time on my machine. I could downsample bam file, or do some other workaround, but knowing exactly how off picard complexity estimate is will help me decide how much effort to put in it. If it's off by 30-40% then it's probably not worth the trouble, but if it's off by 3-4x than I'll have to integrate preseq into the analysis pipeline.
Emo.Zhao is offline   Reply With Quote
Old 03-30-2013, 04:02 PM   #10
timydaley
Member
 
Location: Los Angeles

Join Date: Jun 2010
Posts: 26
Default

100M reads is quite a lot. Downsampling and using 20M or so, you should get extremely accurate results extrapolating pretty much as far as you want to go while cutting down on the computational time significantly.

As far as the bias for the Poisson, it depends a lot on the library. All libraries will appear worse than they actually are under the Poisson model, with poorer complexity libraries appearing even worse than they actually are.
timydaley is offline   Reply With Quote
Old 03-30-2013, 04:53 PM   #11
Emo.Zhao
Junior Member
 
Location: US

Join Date: Oct 2011
Posts: 3
Default

Thanks! Good to know poission model always underestimates library complexity. From your paper, ZTNB sometimes over estimates, Fig. 2B for example. I wasn't sure if that could also happen to poisson model.

So the overestimate of ZTNB must be the result of second parameter of negative binomial not being estimated correctly from data?
Emo.Zhao is offline   Reply With Quote
Old 04-01-2013, 06:15 AM   #12
kristofit
Junior Member
 
Location: france

Join Date: Apr 2012
Posts: 3
Default

Dear Timothy

Thank you very much for your answer and help. I have used the the newest version of preseq (maybe it could be named v1.1). It's great that paired-end data can now be used (option -P). It seems (in my hands) that c_curve (from newest preseq) using input.bam did not work properly (compiled as 'make all BAMTOOLS_ROOT=$bamtools'). However Input.bed (make all) works fine.

Is the more accurate library complexity results returned as the last line of the c_curve out file ?

As the lc_extrap takes very long time to run for 100M reads, you recommend to downsample and use 20M from bam file (or bed) to get an accurate result. From an ordered .bam or .bed, using 20M out of 100M, all chromosomes will not be sampled, does that matter ?
Many thanks in advance
kristofit is offline   Reply With Quote
Old 04-01-2013, 06:43 AM   #13
timydaley
Member
 
Location: Los Angeles

Join Date: Jun 2010
Posts: 26
Default

Quote:
Originally Posted by kristofit View Post
From an ordered .bam or .bed, using 20M out of 100M, all chromosomes will not be sampled, does that matter ?
By downsampling, I mean sampling 20M reads at random without replacement from the set of all reads. As you point out, sampling the sample by truncation (taking the first 20M reads) will possibly result in a biased sample. I'm not entirely certain, but I would imagine that different chromosome possibly have very different statistical properties.
timydaley is offline   Reply With Quote
Old 04-01-2013, 08:20 AM   #14
lchong
Junior Member
 
Location: Toronto, Canada

Join Date: Feb 2012
Posts: 2
Default

Hi Timothy,

Thanks very much for this tool. I'm trying to run c_curve on a large BAM file. With adequate memory allocated it seems to run fine, but after a while I'm getting the output "ERROR: no chrom with id: -1". Any idea what could be causing this error? Seems like it may be a problem with the BAM, but I haven't run into any issues in my previous analysis.

Thanks in advance!
lchong is offline   Reply With Quote
Old 04-01-2013, 09:15 AM   #15
timydaley
Member
 
Location: Los Angeles

Join Date: Jun 2010
Posts: 26
Default

Quote:
Originally Posted by lchong View Post
Hi Timothy,
I'm trying to run c_curve on a large BAM file. With adequate memory allocated it seems to run fine, but after a while I'm getting the output "ERROR: no chrom with id: -1". Any idea what could be causing this error? Seems like it may be a problem with the BAM, but I haven't run into any issues in my previous analysis.
I have no idea. I am fairly confident that's a problem with bam or possibly some interface problem with bamtools. How was the bam file generated?
timydaley is offline   Reply With Quote
Old 04-01-2013, 10:54 AM   #16
timydaley
Member
 
Location: Los Angeles

Join Date: Jun 2010
Posts: 26
Default

Quote:
Originally Posted by Emo.Zhao View Post
So the overestimate of ZTNB must be the result of second parameter of negative binomial not being estimated correctly from data?
That is assuming the data is Negative Binomially distributed, which we know it is not. The ZTNB parameter estimation was naturally tested against ZTNB distributed data to ensure the EM algorithm worked correctly. We do know that the simple Poisson model will consistently underestimate the complexity due to the strict concavity of the objective function (1 - e^{- t * lambda}) and applying Jensen's inequality. The ZTNB will tend to underestimate, and this has been observed in the capture-recapture literature, but there is no guarantee since we cannot apply Jensen in this case.
timydaley is offline   Reply With Quote
Old 04-03-2013, 12:59 PM   #17
kadircaner
Member
 
Location: Houston,TX

Join Date: Jun 2011
Posts: 13
Default

Hi Timothy,

Congrats on your paper and thanks a lot for such a useful tool!!

I have a question about enriched genomic windows, in your paper you said windows could be a better read out compared to distinct reads for library saturation/complexity.

But you did not mention how one can use your tool to perform such analysis... Do you use another tool for testing enriched genomic windows or there is hidden parameter in the tool?

Thanks in advance...
kadircaner is offline   Reply With Quote
Old 04-03-2013, 01:53 PM   #18
timydaley
Member
 
Location: Los Angeles

Join Date: Jun 2010
Posts: 26
Default

Thanks kadircaner.

We did not include this functionality of this application to the method. It was an example for a broader application of the method. We could have also, for example, applied the method to predicting the number of exons observed in a RNA-seq experiment, where reads are binned by the exon in which the starting position of the read is contained in (with reads that don't map to exons thrown out). Or we apply to the method to predicting distinct alternative splicing events, with reads counted only if a splice is present and duplicates include not only duplicate reads but duplicate splicing events. All that matters is that each read is either identified with an event (i.e. exon or bin, but cannot be identified with more than one event) or is unidentifiable. Analogously, we could be thinking about the species problem discussed in the paper and counting copies of individuals, in which case we would be making inferences on the number of individuals, or counting copies of species, in which case we would be making inferences on the number of species.

If you want to apply this, message me and I can direct you how to download the software we used to bin the reads which takes as input a mapped and sorted bed file. A command line argument will then get you the number duplicates in a bin and this can be inputed in the newest version of the software, which allows you to just input a text file of duplicate counts. This may suffice in the meantime but we will consider including this as an option in future versions, thank you.
timydaley is offline   Reply With Quote
Old 04-17-2013, 10:25 AM   #19
aprilw
Junior Member
 
Location: Princeton, NJ

Join Date: May 2012
Posts: 8
Default Loading count table

Hi! Great paper!

I have a question about how to use preseq with the table of counts.

I have generated a table from our RNA-Seq experiments that denotes how many reads map to each exon regardless of whether or not the reads are exactly the same.

When I try to load these counts as a text file with the -V option I get the following error:
INVALID INPUT 0
ERROR: ERROR IN INPUT

I think I'm misunderstanding what you mean by "input is a text file of counts in a single column" and "one count per line" in the manual. What I've been inputting is simply a single column list of counts, for example:

610
0
2554
109
4
6
...

Where each row is a separate exon, some exons are not observed in this experiment and/or at this level of coverage so they have 0 values.

Some clarification to what exactly I've misunderstood would be much appreciated.

Thanks in advance!
April
aprilw is offline   Reply With Quote
Old 04-17-2013, 05:05 PM   #20
timydaley
Member
 
Location: Los Angeles

Join Date: Jun 2010
Posts: 26
Default

April,
Thank you for using our software. It seems that you generated the counts correctly except for including the zero counts. Our software does not take 0 values as input. This is because you can't be sure if, in your case, you didn't see an exon because it wasn't included in any expressed transcript and isn't in the library, or if the exon is contained in the library and was missed due to low sequencing depth or random chance. If you just take the zero counts out, it should work fine.
timydaley is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:04 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO