Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • yujiro
    Junior Member
    • Jul 2010
    • 5

    how can i speed up bwa?

    hello,

    i do chip-seq on illumina and very much appreciate information in this forum. this is my first post and i would like to ask you possible ways to speed up bwa.

    currently, i use a workstation with xeon e5420 (4 core) x 2 and 24 gb memory. it takes about 3 hours to alingn sinle lane reads from GA to human genome during which cpu usage remains 100% for all cores.

    i have seen attempts to speed up bwa by cuda (called barracuda) by a cambridge group. interestingly, they show that 4 or 8 cpu cores do not make much difference in run time. performance of their cuda version aligner was not much different from that of bwa with 4-8 cpu cores.

    how much difference do you predict it will make if i

    1. use fast ssd drives (sata3, raid arrays) instead of hdd, because access to huge sequence data might become a bottleneck

    2. run barracuda on nvidia tesla S2050 which has 1792 cuda cores or even on massively parallel supercomputer (with necessary optimization of the algorithm), if sequence alignment tasks can be effectively broken up into thausands of parallel processes

    3. optimize bwa algorithm to make use of memories up to 64GB or more and cpu powers (multi-threading) up to 48 or more cores/threads. it is clear that multi-threading does not speed up things in the current bwa. howeve, i guess it must be possible to assign each 24 core/thread of cpu with individual chromosomes or long/short arms for instance.

    any suggestions would be mostly appreciated,
  • stefanoberri
    Member
    • Jan 2010
    • 35

    #2
    Hi, I am not an expert in optimization, but here my suggestions

    Originally posted by yujiro View Post
    it takes about 3 hours to alingn sinle lane reads from GA to human genome during which cpu usage remains 100% for all cores.
    If you look, you will notice that for some of the time it only uses 1 core and then when it actually do the alignement uses all that you specified. So there is a step bound to one processor. So having 10 processors will make only the actual alignment faster, not the "other" step.

    Originally posted by yujiro View Post
    1. use fast ssd drives (sata3, raid arrays) instead of hdd, because access to huge sequence data might become a bottleneck
    I doubt this is the bottleneck. To simply read the file it takes 3 hour to align will take a few minutes at most.

    No idea about CUDA

    Originally posted by yujiro View Post
    3. optimize bwa algorithm to make use of memories up to 64GB or more and cpu powers (multi-threading) up to 48 or more cores/threads. it is clear that multi-threading does not speed up things in the current bwa. howeve, i guess it must be possible to assign each 24 core/thread of cpu with individual chromosomes or long/short arms for instance.
    More memory should not make a difference. In 3Gb of RAM it store the whole indexed genome. Having more I doubt would help unless you rediseign bwa.

    Rather than splitting the reference genome, I would split the input fastq file and align them independently and then merge them back. Each sequence is aligned independently. This way you might gain some time when it is using only one processor

    If anybody knows more, please let me know. I am also interested in understanding how to get the most out of my CPUs

    Comment

    • yujiro
      Junior Member
      • Jul 2010
      • 5

      #3
      thanks

      hi stefanoberri,

      thanks a lot for your suggestions. i got your points about ssd drive and memory size. i particularly like your idea of splitting read sequences into smaller files since i can do it by writing a short script without touching bwa itself.

      to rephrase my question about the memory size, i wonder if things get faster by using uncompressed genomic sequences. burrows wheeler transform serves two purposes here, if i understand correctly, one is to compress data to a few GB, another is to generate suffix trie which is used for finding substrings that match the query. i do not know how burrows wheeler compression and decompression processes are integrated with smith waterman algorithm in bwa. if you have a lot of memory and do not have to compress reference sequences, will you not save time by skipping the decompression process?

      thanks a lot

      Comment

      • stefanoberri
        Member
        • Jan 2010
        • 35

        #4
        Hi. I don't think bwa compress/decompress data.
        The suffix trie is a way to have all the genome indexed in memory in an efficient way so that it can fit in 3GB of RAM. Maybe they did take some decisions to compromise size and efficeincy, but I don't think there is any compression (like there is in the bam file, for instance) involved that you can skip.

        Comment

        • yujiro
          Junior Member
          • Jul 2010
          • 5

          #5
          hi stefanoberri,

          thanks for your comments. if you could kindly have a look at Li and Durbin's original bwa paper,



          they mention in section 2.6 reducing memory that inverse compressed suffix array (CSA) is obtained from occurrence array and that suffix array is calculated from inverse CSA.

          by so doing they reduce memory requirement from n[log2n] to 4n+n[log2n]/8. compression of the suffix array might be an intergral part of burrows wheeler, but i wonder if these calculations can be skipped.

          thanks,

          Comment

          • dp05yk
            Member
            • Dec 2010
            • 66

            #6
            Anyone can feel free to correct me - I may not be totally correct in this:

            The thing with BWA is that it only runs on one processor at a time. So even though you have two processors, each with 4 cores (for 8 cores total), BWA will only run on one processor, and thus multithreading will be maxed out at 4 threads.

            You can specify 8 threads, but I'm guessing 4 of the 8 threads will be spawned in the master process and will be executed in a pseudo-threaded manner. 8 threads will then, in effect, be slower than 4 threads, as 4 of the threads aren't being executed simultaneously and will increase competition and blockage over sequence distribution.

            So multithreading actually *does* speedup BWA. 8 threads _will_ speed up more than 4, as long as you are running BWA on an 8-core processor instead of a 4-core processor.

            Comment

            • dp05yk
              Member
              • Dec 2010
              • 66

              #7
              Okay, I just tested this theory and I think I'm correct. I performed 'aln' for 25 million Illumina reads on a 24-core processor with 12, 24, and 48 threads respectively. Here are my results:

              12 Threads - 2:44
              24 Threads - 1:34
              48 Threads - 1:59

              Comment

              • nilshomer
                Nils Homer
                • Nov 2008
                • 1283

                #8
                See Amdahl's Law: http://en.wikipedia.org/wiki/Amdahl%27s_law

                Comment

                • Richard Finney
                  Senior Member
                  • Feb 2009
                  • 701

                  #9
                  If you have the source, edit the makefile to get rid of "-g" (turn "debug on" to "debug off") and bump up -O2 to -O3. Sometimes -Os alone (optimize for size) does the trick. The reason -Os works is because it keeps the code in cache and keeping as much as possible in the L1 or L2 cache is a great improvement on a modern CPU. This might get you a little boost. Also, if you have access to the intel C compiler, you might want to use that.

                  What works for me is keep threads at one(1) but launch 4 bwa processes (or as many as cores as you have on the machine) at once. Example: split input fasta in 4 files and do this at the command line

                  ./job1 &
                  ./job2 &
                  ./job3 &
                  ./job4 &
                  wait
                  echo "did all 4, dude, now ... check results"

                  Comment

                  • yujiro
                    Junior Member
                    • Jul 2010
                    • 5

                    #10
                    thanks

                    hi guys,

                    thanks a lot for insightful comments. it is a bit puzzling why multithreading should not work over multiple cpu's, but i will have a look at the source code. for the time being, splitting input files into the number of cpu's or threads will greatly save my time.

                    Comment

                    • dp05yk
                      Member
                      • Dec 2010
                      • 66

                      #11
                      Whether or not multithreading works over multiple nodes is dependent on how your system hardware works. Threads need to share RAM and global variables so if your nodes each have their own RAM then threads cannot co-exist on each node. I'm guessing that if your nodes had some sort of shared memory space then it would be possible to utilize all 8 cores, but I'm no expert on computer architecture so I couldn't tell you how to go about looking into this.

                      Comment

                      • RDW
                        Member
                        • Oct 2008
                        • 63

                        #12
                        I've just run a quick bwa aln test on a random fastq file, and found that, on my system at least, bwa benefits from multiple cores spread across two processors and from hyperthreading.

                        The workstation has 2 Intel x5690 processors, each with 6 cores, so a total of 12 cores. With hyperthreading enabled in the BIOS I get 24 'virtual cores'. Memory node interleaving is currently set to SMP mode - I haven't tested NUMA. I'm running a recent x64 Linux kernel. Time to complete job:

                        00:13:44 - 6 threads - HT disabled
                        00:07:45 - 12 threads - HT disabled
                        00:07:49 - 24 threads - HT disabled

                        00:14:05 - 6 threads - HT enabled
                        00:07:42 - 12 threads - HT enabled
                        00:05:33 - 24 threads - HT enabled

                        So on this system it's best to use as many threads as there are cores (or virtual cores with hyperthreading enabled) for the bwa aln step. Of course things will get much more complicated if I want to optimise for an entire pipeline with several single thread bottlenecks!

                        Comment

                        • dp05yk
                          Member
                          • Dec 2010
                          • 66

                          #13
                          Makes sense - since you're able to specify a shared memory mode, that will break down the barriers originally in place by separate nodes.

                          Comment

                          • RDW
                            Member
                            • Oct 2008
                            • 63

                            #14
                            If anything, what surprised me was the apparently significant extra benefit of enabling hyperthreading, which I'd been sceptical about. I'll have to see if this helps with GATK, etc.

                            Comment

                            • earonesty
                              Member
                              • Mar 2011
                              • 52

                              #15
                              Obviously since your process is CPU-bound (using all cores), using faster hard drives or multiple threads shouldn't help. Giving bwa more ram could help I guess, but my experience with bwa is that it will make good use of RAM anyway.

                              Poor alignment quality slows down bwa because it has to work harder. One way to speed things up is to clean up your fastq file before feeding it to the aligner. (Removing N's, low quality sequence tails, adapter/primer reads, etc.)

                              And of course, you can just use more than 1 machine.
                              Last edited by earonesty; 04-15-2011, 12:41 PM.

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                                by SEQadmin2


                                I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


                                Here are nine questions we think about, in roughly the order they matter, before...
                                06-18-2026, 07:11 AM
                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                06-02-2026, 10:05 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, 06-17-2026, 06:09 AM
                              0 responses
                              25 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-09-2026, 11:58 AM
                              0 responses
                              42 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-05-2026, 10:09 AM
                              0 responses
                              48 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-04-2026, 08:59 AM
                              0 responses
                              49 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...