Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • samtools mpileup very slow

    What are the typical runtimes for samtools mpileup? I have been running for 16 hours and so far it is only gotten through chr1, 2 and 3. I have 3 bam files averaging about 42.5 gigs each.

    Is there a snp caller with a faster run time? I like samtools, but I can't wait a week for samtools to processes these files.

  • #2
    I always input a bed file with the "-l" option of the intervals I want to call variants in. It will speed things up dramatically, unless you are actually analyzing the whole genome.

    Comment


    • #3
      If you are putting in 3 bam files of ~42.5 gigs each......its going to take a LONG time simply because there is so much data for it to go through. Theres probably nothing wrong and nothing you can do. Its just that much data.

      How many reads is that exactly?
      Last edited by chadn737; 03-08-2012, 05:50 AM.

      Comment


      • #4
        It's > 600 million reads.

        Comment


        • #5
          Most tools working with BAM are likely to be slow. You could give the SNP caller provided with Goby a try. It is must faster than samtools, and can perform local realignment around indels on the fly. Our internal benchmarks show reasonable agreement with other callers.

          Since you already have BAM alignments, you will need convert BAM files to the Goby format (much smaller). See this tutorial to do this on the command line:


          You could also get sorted alignments directly in Goby format with GobyWeb. You would need to run a local instance (in this case, the web app supports calling genotypes or comparisons of genotypes across groups of samples directly from the user interface).

          To call genotypes from the command line, see these two links:




          Goby will produce VCF files compatible with most tools (IGV, vcf-tools, bedtools).

          Questions, or need for additional help, please post to the project mailing list/forum at https://groups.google.com/forum/?fro...goby-framework

          Comment


          • #6
            No, reading BAM is definitely not the bottleneck. Reading 3*42GB only takes about an hour. What is slow is actual SNP calling (especially INDEL calling). You should launch multiple mpileup instances with each calling a non-overlap region. You can specify region with "-r".

            There is nothing wrong with samtools or BAM. Long running time is just the price of getting a lot of data. There are faster ways, but they are less accurate in general. Learn to parallelize. You will need that all the time in the NGS era.

            BTW: Goby is only "much" smaller when one wants to keep multiple hits, but for variant calling, this is not the case. And the way goby keeps multiple hits is likely to incur more seeking which is very slow over network. I can imagine Goby can be an improvement over BAM because BAM is not really carefully designed, but I do not think the difference is significant.
            Last edited by lh3; 03-08-2012, 05:57 AM.

            Comment


            • #7
              There has been some confusion regarding Goby, the file sizes it produces and what kind of data it keeps. I realize the confusion is likely due to problems in the conversion tool we provided with earlier versions of Goby (Goby mode called sam-to-compact). I wanted to indicate here that we recently fixed a number of problems with this conversion tool. A much improved one is included in Goby 1.9.8.4.

              This being said, I don't understand the section of Heng's comment that describe the behavior of Goby when keeping multiple hits. The Goby format can store multiple best hits if the user decides so. In this case, alignments with 1 or up to T locations in the genome are stored in the .entries file. The T threshold is usually determined by limitations of the aligner (e.g., some will not report any matches if a read matches more than 10 locations, others have a configurable T threshold). Reads that match more than T are recorded in Goby file with .tmh extension (because they are not stored in the main alignment file). Not sure how this behavior would have any effect on seeking times.

              To help clear some confusion, we have prepared a small benchmark. We obtained simulated paired-end data from GASV (here: http://code.google.com/p/gasv/downlo...me=Example.bam). This file is 80MB and was converted to Goby format as follows (using 1.9.8.4):

              java -Dlog4j.configproperties=9.8.4/config/log4j-sample.properties -jar goby_1.9.8.4/goby.jar -m sam-to-compact -i Example.bam -o Example-sorted --sorted

              The resulting files are:

              33M Example-sorted.entries
              97B Example-sorted.header
              1.5K Example-sorted.index
              22B Example-sorted.tmh

              This alignment can be loaded into IGV because it is sorted. One can verify the type of information preserved in Goby format (variations, quality scores over mutations, pairs, etc).

              The input file(s) were:

              80M Example.bam
              207K Example.bam.bai

              A longer version of this tiny benchmark is also posted at http://campagnelab.org/bam-goby-tiny...ired-end-data/ with links to the alignment files (useful to load into IGV with open URL).

              Comment


              • #8
                Yes, I realize it later that my claims are not all correct. My apology. Goby achieves a small file size by lossy compression like Cram.

                Comment


                • #9
                  Heng, you are right that Goby is lossy if you compare to BAM because we throw out sequence and quality scores in regions that match exactly the reference.

                  I think the approach is quite different from Cram because protocol buffer (used in Goby) makes it much easier to add data into the format. It takes about 5 minutes to add a new field, you recompile the sources and both new software and old software can still work with the new files (old software won't know about new fields and simply ignore this type of data).

                  This type of flexibility is currently impossible to achieve with the approaches Cram implements. I believe the jury is still out for an approach that will offer the best of both worlds: format flexibility/extensibility and very strong compression. I think it would be hard to live without format flexibility given the pace of development in the next-gen field.

                  Comment


                  • #10
                    I have relied your other comments in samtools-devel.

                    I admit that I am not familiar with the internal structure of Cram and Goby, but as a naive enduser, Cram seems more mature and more flexible for the time being. My opinion is a generic format ought to keep the exact CIGAR, the pairing, all the tags, read groups and optionally all the base quality without referring to another file. BAM, BioHDF, Cram and cSRA have achieved all these. For variant calling alone we may not need every bit of information, but for other applications we will.

                    As to the way of lossy compression of base quality, I have said previously that I do not think throwing away all the matches is right. I much prefer reduced resolution which is implemented in cSRA and supposedly in Cram as well.
                    Last edited by lh3; 03-16-2012, 06:42 AM.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM
                    • seqadmin
                      Techniques and Challenges in Conservation Genomics
                      by seqadmin



                      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                      Avian Conservation
                      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                      03-08-2024, 10:41 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Yesterday, 06:37 PM
                    0 responses
                    12 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, Yesterday, 06:07 PM
                    0 responses
                    10 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-22-2024, 10:03 AM
                    0 responses
                    52 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-21-2024, 07:32 AM
                    0 responses
                    68 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X