Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • preseq: predicting the complexity of genomic sequencing libraries

    It is the pleasure of the Smith lab at USC to announce the publication of the preseq manuscript in Nature Methods, currently available as Advanced Online Publication ( link ).

    The preseq software is a tool designed to predict the complexity of a genomic library, quantified as the number of distinct reads obtained for a hypothetical sequencing depth. It takes as input a small mapped initial sequencing experiment in either BED or BAM format sorted by mapping location. The idea is to use the information gained from the number of times each read is observed to predict the number of new, currently unobserved, reads that will be gained from additional sequencing. The underlying model is taken from capture-recapture statistics and is shown to be applicable to a broad range of sequencing experiments, including RNA-seq, ChIP-seq, BS-seq, and Exome-seq. Our lab and our collaborators currently use preseq as part of our bisulfite sequencing pipeline to prevent deep sequencing of low complexity libraries and to optimise the depth of sequencing.

    For more information or to download the software, please visit the preseq website:

    http://smithlab.usc.edu/software/librarycomplexity

  • #2
    problem running with BAM-file

    hi timydaley,

    i have a problem to run your program with a bam file (Scientific Linux 6.3 = RHEL 6.3).

    gsl 1.15 and bamtools are installed.
    your program also
    Code:
    make all BAMTOOLS_ROOT=/home/ws/SW_install/bamtools/bamtools
    Code:
    g++ -Wall -fPIC -fmessage-length=50 -DHAVE_BAMTOOLS -c -o continued_fraction.o continued_fraction.cpp -I/home/ws/SW_install/presec/PRE-seq/smithlab_cpp/ -I/home/ws/SW_install/bamtools/bamtools/include
    g++ -Wall -fPIC -fmessage-length=50 -DHAVE_BAMTOOLS -o lc_extrap lc_extrap.cpp /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//GenomicRegion.o /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//smithlab_os.o /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//smithlab_utils.o /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//OptionParser.o /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//MappedRead.o /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//RNG.o continued_fraction.o -I/home/ws/SW_install/presec/PRE-seq/smithlab_cpp/ -I/home/ws/SW_install/bamtools/bamtools/include -lgsl -lgslcblas  -L/home/ws/SW_install/bamtools/bamtools/lib -lz -lbamtools
    g++ -Wall -fPIC -fmessage-length=50 -DHAVE_BAMTOOLS -o c_curve c_curve.cpp /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//GenomicRegion.o /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//smithlab_os.o /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//smithlab_utils.o /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//OptionParser.o /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//MappedRead.o /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//RNG.o -I/home/ws/SW_install/presec/PRE-seq/smithlab_cpp/ -I/home/ws/SW_install/bamtools/bamtools/include -lgsl -lgslcblas  -L/home/ws/SW_install/bamtools/bamtools/lib -lz -lbamtools
    but running the executables yields following error:

    Code:
    ./c_curve: error while loading shared libraries: libbamtools.so.2.2.3: cannot open shared object file: No such file or directory
    ./lc_extrap: error while loading shared libraries: libbamtools.so.2.2.3: cannot open shared object file: No such file or directory
    what do i miss...

    thank you!

    Comment


    • #3
      Ah, yes. This a common problem we encounter with bamtools since we do not do static linking and all the problems that entails. I know of two quick solutions: have your LD_LIBRARY_PATH set to BAMTOOLS_ROOT/lib/, but then you will have problems with other programs or you can copy all the lib files in BAMTOOLS_ROOT/lib/ to your current LD_LIBRARY_PATH, typically ~/lib/.

      Thank you so much for using our software and I hope it is of use in your projects.

      Comment


      • #4
        This seems like a very interesting tool, thanks for sharing.

        Comment


        • #5
          Dear Timothy

          Thank you for your developing the method and for implementing and sharing the tool. It is a great opportunity to assess the libraries complexity and to have a statistical support to know more about what to expect with deeper sequencing.

          I've run your tool to produce c_curve and lc_extrap output (using .bam from a RNAseq experiment). It worked just fine with default parameters.

          I have 2 comments/questions:
          1- How can we estimate the library complexity in term of distinct loci identified ? (It seems possible to compute it but I could not find how to do in the manual.pdf)

          2- could you help to interpret the c_curve out file (-verbose) for "DISTINCT COUNTS/MAX COUNT/COUNTS OF 1/MAX TERMS" ?
          (my example)
          TOTAL READS = 41159280
          DISTINCT READS = 16910777
          DISTINCT COUNTS = 1258
          MAX COUNT = 16177
          COUNTS OF 1 = 1.1352e+07
          MAX TERMS = 1000
          OBSERVED COUNTS (16178)

          Thank you for your time

          Comment


          • #6
            Originally posted by kristofit View Post

            1- How can we estimate the library complexity in term of distinct loci identified ? (It seems possible to compute it but I could not find how to do in the manual.pdf)
            To compute estimates in terms of distinct loci identified you need to produce the counts of duplicate events in terms of the loci of interest. For example, you may be interested in counting distinct exons for a RNA-seq experiment. The UMI would then be the exon to which a read starts at (or ends at, or is completely contained within, either way the UMI needs to be consistent across the initial and full experiment). Duplicate events would be reads that map to the same exon, regardless if they themselves are duplicates or not.

            Or take the example from the paper, where we looked at fixed non-overlapping 1kb bins for a ChIP-seq experiment. For that we used the mapped starting location of the read to identify the bin to which a read belonged to. Duplicate events include distinct reads that map to the same bin and duplicate reads.

            Once you have the duplicate counts, the newest version of preseq (available on our website smithlab.usc.edu/sorftware/librarycomplexity ) allows you to input this as a text file, one count per line as detailed in the manual.


            Originally posted by kristofit View Post

            2- could you help to interpret the c_curve out file (-verbose) for "DISTINCT COUNTS/MAX COUNT/COUNTS OF 1/MAX TERMS" ?
            (my example)
            TOTAL READS = 41159280
            DISTINCT READS = 16910777
            DISTINCT COUNTS = 1258
            MAX COUNT = 16177
            COUNTS OF 1 = 1.1352e+07
            MAX TERMS = 1000
            OBSERVED COUNTS (16178)
            Well, you had ~41M mapped reads in the experiment, ~17M distinct reads,
            11M singleton reads, and the read with the largest number of duplicates had 16177 copies.

            Thank you for using the software and if you have anymore questions, feel free to contact me.
            Last edited by timydaley; 03-29-2013, 03:22 PM.

            Comment


            • #7
              Comparison with Picard EstimateLibraryCompleixty?

              Hi Tim,

              Congratulations on a very nice paper, I thought you laid out the problem very nicely and provided enough details of the method to be very convincing. I'm curious about how much estimates from your method differs from those provided by tools commonly used today, such as EstimateLibraryComplexity.jar in picard. It uses a formula that only uses total number of reads in sample and total number of unique reads: http://picard.sourceforge.net/javado...e(long,%20long)

              I'm guessing picard's estimate is probably even more biased than ZTNB model, but I don't know how far off it actually is. It'll be very interesting to see how it performs in your comparisons.

              Thanks for the nice paper/software!

              Comment


              • #8
                Thank you Mr. Zhao. It appears that estimateLibrarySize assumes a simple Lander-Waterman model, which would correspond to a simple Poisson model. The ZTNB model is much broader class that includes the simple Poisson model (taking alpha -> 0). Therefore, the estimates from such a model can only be more biased than the ZTNB estimates.

                If you have any more questions or would like to discuss the model, feel free to contact me. I am happy to answer any questions that I am able to.

                Comment


                • #9
                  Ok, so my question is a practical one: our pipeline uses picard markduplicates and estimate library complexity. The estimate may not be very accurate, but it's basically for free computationally. Preseq is more accurate, but running it on bams with ~ 100 million pair end reads is takes a long time on my machine. I could downsample bam file, or do some other workaround, but knowing exactly how off picard complexity estimate is will help me decide how much effort to put in it. If it's off by 30-40% then it's probably not worth the trouble, but if it's off by 3-4x than I'll have to integrate preseq into the analysis pipeline.

                  Comment


                  • #10
                    100M reads is quite a lot. Downsampling and using 20M or so, you should get extremely accurate results extrapolating pretty much as far as you want to go while cutting down on the computational time significantly.

                    As far as the bias for the Poisson, it depends a lot on the library. All libraries will appear worse than they actually are under the Poisson model, with poorer complexity libraries appearing even worse than they actually are.

                    Comment


                    • #11
                      Thanks! Good to know poission model always underestimates library complexity. From your paper, ZTNB sometimes over estimates, Fig. 2B for example. I wasn't sure if that could also happen to poisson model.

                      So the overestimate of ZTNB must be the result of second parameter of negative binomial not being estimated correctly from data?

                      Comment


                      • #12
                        Dear Timothy

                        Thank you very much for your answer and help. I have used the the newest version of preseq (maybe it could be named v1.1). It's great that paired-end data can now be used (option -P). It seems (in my hands) that c_curve (from newest preseq) using input.bam did not work properly (compiled as 'make all BAMTOOLS_ROOT=$bamtools'). However Input.bed (make all) works fine.

                        Is the more accurate library complexity results returned as the last line of the c_curve out file ?

                        As the lc_extrap takes very long time to run for 100M reads, you recommend to downsample and use 20M from bam file (or bed) to get an accurate result. From an ordered .bam or .bed, using 20M out of 100M, all chromosomes will not be sampled, does that matter ?
                        Many thanks in advance

                        Comment


                        • #13
                          Originally posted by kristofit View Post
                          From an ordered .bam or .bed, using 20M out of 100M, all chromosomes will not be sampled, does that matter ?
                          By downsampling, I mean sampling 20M reads at random without replacement from the set of all reads. As you point out, sampling the sample by truncation (taking the first 20M reads) will possibly result in a biased sample. I'm not entirely certain, but I would imagine that different chromosome possibly have very different statistical properties.

                          Comment


                          • #14
                            Hi Timothy,

                            Thanks very much for this tool. I'm trying to run c_curve on a large BAM file. With adequate memory allocated it seems to run fine, but after a while I'm getting the output "ERROR: no chrom with id: -1". Any idea what could be causing this error? Seems like it may be a problem with the BAM, but I haven't run into any issues in my previous analysis.

                            Thanks in advance!

                            Comment


                            • #15
                              Originally posted by lchong View Post
                              Hi Timothy,
                              I'm trying to run c_curve on a large BAM file. With adequate memory allocated it seems to run fine, but after a while I'm getting the output "ERROR: no chrom with id: -1". Any idea what could be causing this error? Seems like it may be a problem with the BAM, but I haven't run into any issues in my previous analysis.
                              I have no idea. I am fairly confident that's a problem with bam or possibly some interface problem with bamtools. How was the bam file generated?

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              9 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X