Seqanswers Leaderboard Ad

**dietmar13** · 02-25-2013, 02:05 PM

problem running with BAM-file

hi timydaley,

i have a problem to run your program with a bam file (Scientific Linux 6.3 = RHEL 6.3).

gsl 1.15 and bamtools are installed.
your program also

Code:

make all BAMTOOLS_ROOT=/home/ws/SW_install/bamtools/bamtools

Code:

g++ -Wall -fPIC -fmessage-length=50 -DHAVE_BAMTOOLS -c -o continued_fraction.o continued_fraction.cpp -I/home/ws/SW_install/presec/PRE-seq/smithlab_cpp/ -I/home/ws/SW_install/bamtools/bamtools/include
g++ -Wall -fPIC -fmessage-length=50 -DHAVE_BAMTOOLS -o lc_extrap lc_extrap.cpp /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//GenomicRegion.o /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//smithlab_os.o /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//smithlab_utils.o /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//OptionParser.o /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//MappedRead.o /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//RNG.o continued_fraction.o -I/home/ws/SW_install/presec/PRE-seq/smithlab_cpp/ -I/home/ws/SW_install/bamtools/bamtools/include -lgsl -lgslcblas  -L/home/ws/SW_install/bamtools/bamtools/lib -lz -lbamtools
g++ -Wall -fPIC -fmessage-length=50 -DHAVE_BAMTOOLS -o c_curve c_curve.cpp /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//GenomicRegion.o /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//smithlab_os.o /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//smithlab_utils.o /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//OptionParser.o /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//MappedRead.o /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//RNG.o -I/home/ws/SW_install/presec/PRE-seq/smithlab_cpp/ -I/home/ws/SW_install/bamtools/bamtools/include -lgsl -lgslcblas  -L/home/ws/SW_install/bamtools/bamtools/lib -lz -lbamtools

but running the executables yields following error:

Code:

./c_curve: error while loading shared libraries: libbamtools.so.2.2.3: cannot open shared object file: No such file or directory
./lc_extrap: error while loading shared libraries: libbamtools.so.2.2.3: cannot open shared object file: No such file or directory

what do i miss...

thank you!

**timydaley** · 02-25-2013, 02:23 PM

Ah, yes. This a common problem we encounter with bamtools since we do not do static linking and all the problems that entails. I know of two quick solutions: have your LD_LIBRARY_PATH set to BAMTOOLS_ROOT/lib/, but then you will have problems with other programs or you can copy all the lib files in BAMTOOLS_ROOT/lib/ to your current LD_LIBRARY_PATH, typically ~/lib/.

Thank you so much for using our software and I hope it is of use in your projects.

**kopi-o** · 02-25-2013, 03:11 PM

This seems like a very interesting tool, thanks for sharing.

**kristofit** · 03-29-2013, 04:03 AM

Dear Timothy

Thank you for your developing the method and for implementing and sharing the tool. It is a great opportunity to assess the libraries complexity and to have a statistical support to know more about what to expect with deeper sequencing.

I've run your tool to produce c_curve and lc_extrap output (using .bam from a RNAseq experiment). It worked just fine with default parameters.

I have 2 comments/questions:
1- How can we estimate the library complexity in term of distinct loci identified ? (It seems possible to compute it but I could not find how to do in the manual.pdf)

2- could you help to interpret the c_curve out file (-verbose) for "DISTINCT COUNTS/MAX COUNT/COUNTS OF 1/MAX TERMS" ?
(my example)
TOTAL READS = 41159280
DISTINCT READS = 16910777
DISTINCT COUNTS = 1258
MAX COUNT = 16177
COUNTS OF 1 = 1.1352e+07
MAX TERMS = 1000
OBSERVED COUNTS (16178)

Thank you for your time

**timydaley** · 03-29-2013, 08:41 AM

Originally posted by kristofit View Post

1- How can we estimate the library complexity in term of distinct loci identified ? (It seems possible to compute it but I could not find how to do in the manual.pdf)

To compute estimates in terms of distinct loci identified you need to produce the counts of duplicate events in terms of the loci of interest. For example, you may be interested in counting distinct exons for a RNA-seq experiment. The UMI would then be the exon to which a read starts at (or ends at, or is completely contained within, either way the UMI needs to be consistent across the initial and full experiment). Duplicate events would be reads that map to the same exon, regardless if they themselves are duplicates or not.

Or take the example from the paper, where we looked at fixed non-overlapping 1kb bins for a ChIP-seq experiment. For that we used the mapped starting location of the read to identify the bin to which a read belonged to. Duplicate events include distinct reads that map to the same bin and duplicate reads.

Once you have the duplicate counts, the newest version of preseq (available on our website smithlab.usc.edu/sorftware/librarycomplexity ) allows you to input this as a text file, one count per line as detailed in the manual.

Originally posted by kristofit View Post

2- could you help to interpret the c_curve out file (-verbose) for "DISTINCT COUNTS/MAX COUNT/COUNTS OF 1/MAX TERMS" ?
(my example)
TOTAL READS = 41159280
DISTINCT READS = 16910777
DISTINCT COUNTS = 1258
MAX COUNT = 16177
COUNTS OF 1 = 1.1352e+07
MAX TERMS = 1000
OBSERVED COUNTS (16178)

Well, you had ~41M mapped reads in the experiment, ~17M distinct reads,
11M singleton reads, and the read with the largest number of duplicates had 16177 copies.

Thank you for using the software and if you have anymore questions, feel free to contact me.

**Emo.Zhao** · 03-30-2013, 03:20 PM

Comparison with Picard EstimateLibraryCompleixty?

Hi Tim,

Congratulations on a very nice paper, I thought you laid out the problem very nicely and provided enough details of the method to be very convincing. I'm curious about how much estimates from your method differs from those provided by tools commonly used today, such as EstimateLibraryComplexity.jar in picard. It uses a formula that only uses total number of reads in sample and total number of unique reads: http://picard.sourceforge.net/javado...e(long,%20long)

I'm guessing picard's estimate is probably even more biased than ZTNB model, but I don't know how far off it actually is. It'll be very interesting to see how it performs in your comparisons.

Thanks for the nice paper/software!

**timydaley** · 03-30-2013, 03:29 PM

Thank you Mr. Zhao. It appears that estimateLibrarySize assumes a simple Lander-Waterman model, which would correspond to a simple Poisson model. The ZTNB model is much broader class that includes the simple Poisson model (taking alpha -> 0). Therefore, the estimates from such a model can only be more biased than the ZTNB estimates.

If you have any more questions or would like to discuss the model, feel free to contact me. I am happy to answer any questions that I am able to.

**Emo.Zhao** · 03-30-2013, 03:55 PM

Ok, so my question is a practical one: our pipeline uses picard markduplicates and estimate library complexity. The estimate may not be very accurate, but it's basically for free computationally. Preseq is more accurate, but running it on bams with ~ 100 million pair end reads is takes a long time on my machine. I could downsample bam file, or do some other workaround, but knowing exactly how off picard complexity estimate is will help me decide how much effort to put in it. If it's off by 30-40% then it's probably not worth the trouble, but if it's off by 3-4x than I'll have to integrate preseq into the analysis pipeline.

**timydaley** · 03-30-2013, 04:02 PM

100M reads is quite a lot. Downsampling and using 20M or so, you should get extremely accurate results extrapolating pretty much as far as you want to go while cutting down on the computational time significantly.

As far as the bias for the Poisson, it depends a lot on the library. All libraries will appear worse than they actually are under the Poisson model, with poorer complexity libraries appearing even worse than they actually are.

**Emo.Zhao** · 03-30-2013, 04:53 PM

Thanks! Good to know poission model always underestimates library complexity. From your paper, ZTNB sometimes over estimates, Fig. 2B for example. I wasn't sure if that could also happen to poisson model.

So the overestimate of ZTNB must be the result of second parameter of negative binomial not being estimated correctly from data?

**kristofit** · 04-01-2013, 06:15 AM

Dear Timothy

Thank you very much for your answer and help. I have used the the newest version of preseq (maybe it could be named v1.1). It's great that paired-end data can now be used (option -P). It seems (in my hands) that c_curve (from newest preseq) using input.bam did not work properly (compiled as 'make all BAMTOOLS_ROOT=$bamtools'). However Input.bed (make all) works fine.

Is the more accurate library complexity results returned as the last line of the c_curve out file ?

As the lc_extrap takes very long time to run for 100M reads, you recommend to downsample and use 20M from bam file (or bed) to get an accurate result. From an ordered .bam or .bed, using 20M out of 100M, all chromosomes will not be sampled, does that matter ?
Many thanks in advance

**timydaley** · 04-01-2013, 06:43 AM

Originally posted by kristofit View Post

From an ordered .bam or .bed, using 20M out of 100M, all chromosomes will not be sampled, does that matter ?

By downsampling, I mean sampling 20M reads at random without replacement from the set of all reads. As you point out, sampling the sample by truncation (taking the first 20M reads) will possibly result in a biased sample. I'm not entirely certain, but I would imagine that different chromosome possibly have very different statistical properties.

**lchong** · 04-01-2013, 08:20 AM

Hi Timothy,

Thanks very much for this tool. I'm trying to run c_curve on a large BAM file. With adequate memory allocated it seems to run fine, but after a while I'm getting the output "ERROR: no chrom with id: -1". Any idea what could be causing this error? Seems like it may be a problem with the BAM, but I haven't run into any issues in my previous analysis.

Thanks in advance!

**timydaley** · 04-01-2013, 09:15 AM

Originally posted by lchong View Post

Hi Timothy,
I'm trying to run c_curve on a large BAM file. With adequate memory allocated it seems to run fine, but after a while I'm getting the output "ERROR: no chrom with id: -1". Any idea what could be causing this error? Seems like it may be a problem with the BAM, but I haven't run into any issues in my previous analysis.

I have no idea. I am fairly confident that's a problem with bam or possibly some interface problem with bamtools. How was the bam file generated?

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Today, 08:47 AM	0 responses 12 views 0 likes	Last Post by seqadmin Today, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

preseq: predicting the complexity of genomic sequencing libraries

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News