Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • how can i speed up bwa?

    hello,

    i do chip-seq on illumina and very much appreciate information in this forum. this is my first post and i would like to ask you possible ways to speed up bwa.

    currently, i use a workstation with xeon e5420 (4 core) x 2 and 24 gb memory. it takes about 3 hours to alingn sinle lane reads from GA to human genome during which cpu usage remains 100% for all cores.

    i have seen attempts to speed up bwa by cuda (called barracuda) by a cambridge group. interestingly, they show that 4 or 8 cpu cores do not make much difference in run time. performance of their cuda version aligner was not much different from that of bwa with 4-8 cpu cores.

    how much difference do you predict it will make if i

    1. use fast ssd drives (sata3, raid arrays) instead of hdd, because access to huge sequence data might become a bottleneck

    2. run barracuda on nvidia tesla S2050 which has 1792 cuda cores or even on massively parallel supercomputer (with necessary optimization of the algorithm), if sequence alignment tasks can be effectively broken up into thausands of parallel processes

    3. optimize bwa algorithm to make use of memories up to 64GB or more and cpu powers (multi-threading) up to 48 or more cores/threads. it is clear that multi-threading does not speed up things in the current bwa. howeve, i guess it must be possible to assign each 24 core/thread of cpu with individual chromosomes or long/short arms for instance.

    any suggestions would be mostly appreciated,

  • #2
    Hi, I am not an expert in optimization, but here my suggestions

    Originally posted by yujiro View Post
    it takes about 3 hours to alingn sinle lane reads from GA to human genome during which cpu usage remains 100% for all cores.
    If you look, you will notice that for some of the time it only uses 1 core and then when it actually do the alignement uses all that you specified. So there is a step bound to one processor. So having 10 processors will make only the actual alignment faster, not the "other" step.

    Originally posted by yujiro View Post
    1. use fast ssd drives (sata3, raid arrays) instead of hdd, because access to huge sequence data might become a bottleneck
    I doubt this is the bottleneck. To simply read the file it takes 3 hour to align will take a few minutes at most.

    No idea about CUDA

    Originally posted by yujiro View Post
    3. optimize bwa algorithm to make use of memories up to 64GB or more and cpu powers (multi-threading) up to 48 or more cores/threads. it is clear that multi-threading does not speed up things in the current bwa. howeve, i guess it must be possible to assign each 24 core/thread of cpu with individual chromosomes or long/short arms for instance.
    More memory should not make a difference. In 3Gb of RAM it store the whole indexed genome. Having more I doubt would help unless you rediseign bwa.

    Rather than splitting the reference genome, I would split the input fastq file and align them independently and then merge them back. Each sequence is aligned independently. This way you might gain some time when it is using only one processor

    If anybody knows more, please let me know. I am also interested in understanding how to get the most out of my CPUs

    Comment


    • #3
      thanks

      hi stefanoberri,

      thanks a lot for your suggestions. i got your points about ssd drive and memory size. i particularly like your idea of splitting read sequences into smaller files since i can do it by writing a short script without touching bwa itself.

      to rephrase my question about the memory size, i wonder if things get faster by using uncompressed genomic sequences. burrows wheeler transform serves two purposes here, if i understand correctly, one is to compress data to a few GB, another is to generate suffix trie which is used for finding substrings that match the query. i do not know how burrows wheeler compression and decompression processes are integrated with smith waterman algorithm in bwa. if you have a lot of memory and do not have to compress reference sequences, will you not save time by skipping the decompression process?

      thanks a lot

      Comment


      • #4
        Hi. I don't think bwa compress/decompress data.
        The suffix trie is a way to have all the genome indexed in memory in an efficient way so that it can fit in 3GB of RAM. Maybe they did take some decisions to compromise size and efficeincy, but I don't think there is any compression (like there is in the bam file, for instance) involved that you can skip.

        Comment


        • #5
          hi stefanoberri,

          thanks for your comments. if you could kindly have a look at Li and Durbin's original bwa paper,



          they mention in section 2.6 reducing memory that inverse compressed suffix array (CSA) is obtained from occurrence array and that suffix array is calculated from inverse CSA.

          by so doing they reduce memory requirement from n[log2n] to 4n+n[log2n]/8. compression of the suffix array might be an intergral part of burrows wheeler, but i wonder if these calculations can be skipped.

          thanks,

          Comment


          • #6
            Anyone can feel free to correct me - I may not be totally correct in this:

            The thing with BWA is that it only runs on one processor at a time. So even though you have two processors, each with 4 cores (for 8 cores total), BWA will only run on one processor, and thus multithreading will be maxed out at 4 threads.

            You can specify 8 threads, but I'm guessing 4 of the 8 threads will be spawned in the master process and will be executed in a pseudo-threaded manner. 8 threads will then, in effect, be slower than 4 threads, as 4 of the threads aren't being executed simultaneously and will increase competition and blockage over sequence distribution.

            So multithreading actually *does* speedup BWA. 8 threads _will_ speed up more than 4, as long as you are running BWA on an 8-core processor instead of a 4-core processor.

            Comment


            • #7
              Okay, I just tested this theory and I think I'm correct. I performed 'aln' for 25 million Illumina reads on a 24-core processor with 12, 24, and 48 threads respectively. Here are my results:

              12 Threads - 2:44
              24 Threads - 1:34
              48 Threads - 1:59

              Comment


              • #8
                See Amdahl's Law: http://en.wikipedia.org/wiki/Amdahl%27s_law

                Comment


                • #9
                  If you have the source, edit the makefile to get rid of "-g" (turn "debug on" to "debug off") and bump up -O2 to -O3. Sometimes -Os alone (optimize for size) does the trick. The reason -Os works is because it keeps the code in cache and keeping as much as possible in the L1 or L2 cache is a great improvement on a modern CPU. This might get you a little boost. Also, if you have access to the intel C compiler, you might want to use that.

                  What works for me is keep threads at one(1) but launch 4 bwa processes (or as many as cores as you have on the machine) at once. Example: split input fasta in 4 files and do this at the command line

                  ./job1 &
                  ./job2 &
                  ./job3 &
                  ./job4 &
                  wait
                  echo "did all 4, dude, now ... check results"

                  Comment


                  • #10
                    thanks

                    hi guys,

                    thanks a lot for insightful comments. it is a bit puzzling why multithreading should not work over multiple cpu's, but i will have a look at the source code. for the time being, splitting input files into the number of cpu's or threads will greatly save my time.

                    Comment


                    • #11
                      Whether or not multithreading works over multiple nodes is dependent on how your system hardware works. Threads need to share RAM and global variables so if your nodes each have their own RAM then threads cannot co-exist on each node. I'm guessing that if your nodes had some sort of shared memory space then it would be possible to utilize all 8 cores, but I'm no expert on computer architecture so I couldn't tell you how to go about looking into this.

                      Comment


                      • #12
                        I've just run a quick bwa aln test on a random fastq file, and found that, on my system at least, bwa benefits from multiple cores spread across two processors and from hyperthreading.

                        The workstation has 2 Intel x5690 processors, each with 6 cores, so a total of 12 cores. With hyperthreading enabled in the BIOS I get 24 'virtual cores'. Memory node interleaving is currently set to SMP mode - I haven't tested NUMA. I'm running a recent x64 Linux kernel. Time to complete job:

                        00:13:44 - 6 threads - HT disabled
                        00:07:45 - 12 threads - HT disabled
                        00:07:49 - 24 threads - HT disabled

                        00:14:05 - 6 threads - HT enabled
                        00:07:42 - 12 threads - HT enabled
                        00:05:33 - 24 threads - HT enabled

                        So on this system it's best to use as many threads as there are cores (or virtual cores with hyperthreading enabled) for the bwa aln step. Of course things will get much more complicated if I want to optimise for an entire pipeline with several single thread bottlenecks!

                        Comment


                        • #13
                          Makes sense - since you're able to specify a shared memory mode, that will break down the barriers originally in place by separate nodes.

                          Comment


                          • #14
                            If anything, what surprised me was the apparently significant extra benefit of enabling hyperthreading, which I'd been sceptical about. I'll have to see if this helps with GATK, etc.

                            Comment


                            • #15
                              Obviously since your process is CPU-bound (using all cores), using faster hard drives or multiple threads shouldn't help. Giving bwa more ram could help I guess, but my experience with bwa is that it will make good use of RAM anyway.

                              Poor alignment quality slows down bwa because it has to work harder. One way to speed things up is to clean up your fastq file before feeding it to the aligner. (Removing N's, low quality sequence tails, adapter/primer reads, etc.)

                              And of course, you can just use more than 1 machine.
                              Last edited by earonesty; 04-15-2011, 12:41 PM.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              11 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              51 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              68 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X