Unconfigured Ad

**stefanoberri** · 04-13-2011, 03:02 AM

Hi, I am not an expert in optimization, but here my suggestions

Originally posted by yujiro View Post

it takes about 3 hours to alingn sinle lane reads from GA to human genome during which cpu usage remains 100% for all cores.

If you look, you will notice that for some of the time it only uses 1 core and then when it actually do the alignement uses all that you specified. So there is a step bound to one processor. So having 10 processors will make only the actual alignment faster, not the "other" step.

Originally posted by yujiro View Post

1. use fast ssd drives (sata3, raid arrays) instead of hdd, because access to huge sequence data might become a bottleneck

I doubt this is the bottleneck. To simply read the file it takes 3 hour to align will take a few minutes at most.

No idea about CUDA

Originally posted by yujiro View Post

3. optimize bwa algorithm to make use of memories up to 64GB or more and cpu powers (multi-threading) up to 48 or more cores/threads. it is clear that multi-threading does not speed up things in the current bwa. howeve, i guess it must be possible to assign each 24 core/thread of cpu with individual chromosomes or long/short arms for instance.

More memory should not make a difference. In 3Gb of RAM it store the whole indexed genome. Having more I doubt would help unless you rediseign bwa.

Rather than splitting the reference genome, I would split the input fastq file and align them independently and then merge them back. Each sequence is aligned independently. This way you might gain some time when it is using only one processor

If anybody knows more, please let me know. I am also interested in understanding how to get the most out of my CPUs

**yujiro** · 04-13-2011, 04:15 AM

thanks

hi stefanoberri,

thanks a lot for your suggestions. i got your points about ssd drive and memory size. i particularly like your idea of splitting read sequences into smaller files since i can do it by writing a short script without touching bwa itself.

to rephrase my question about the memory size, i wonder if things get faster by using uncompressed genomic sequences. burrows wheeler transform serves two purposes here, if i understand correctly, one is to compress data to a few GB, another is to generate suffix trie which is used for finding substrings that match the query. i do not know how burrows wheeler compression and decompression processes are integrated with smith waterman algorithm in bwa. if you have a lot of memory and do not have to compress reference sequences, will you not save time by skipping the decompression process?

thanks a lot

**stefanoberri** · 04-13-2011, 04:24 AM

Hi. I don't think bwa compress/decompress data.
The suffix trie is a way to have all the genome indexed in memory in an efficient way so that it can fit in 3GB of RAM. Maybe they did take some decisions to compromise size and efficeincy, but I don't think there is any compression (like there is in the bam file, for instance) involved that you can skip.

**yujiro** · 04-13-2011, 05:14 AM

hi stefanoberri,

thanks for your comments. if you could kindly have a look at Li and Durbin's original bwa paper,

http://bioinformatics.oxfordjournals.org/content/25/14/1754.full.pdf

they mention in section 2.6 reducing memory that inverse compressed suffix array (CSA) is obtained from occurrence array and that suffix array is calculated from inverse CSA.

by so doing they reduce memory requirement from n[log2n] to 4n+n[log2n]/8. compression of the suffix array might be an intergral part of burrows wheeler, but i wonder if these calculations can be skipped.

thanks,

**dp05yk** · 04-13-2011, 10:50 AM

Anyone can feel free to correct me - I may not be totally correct in this:

The thing with BWA is that it only runs on one processor at a time. So even though you have two processors, each with 4 cores (for 8 cores total), BWA will only run on one processor, and thus multithreading will be maxed out at 4 threads.

You can specify 8 threads, but I'm guessing 4 of the 8 threads will be spawned in the master process and will be executed in a pseudo-threaded manner. 8 threads will then, in effect, be slower than 4 threads, as 4 of the threads aren't being executed simultaneously and will increase competition and blockage over sequence distribution.

So multithreading actually *does* speedup BWA. 8 threads _will_ speed up more than 4, as long as you are running BWA on an 8-core processor instead of a 4-core processor.

**dp05yk** · 04-13-2011, 11:06 AM

Okay, I just tested this theory and I think I'm correct. I performed 'aln' for 25 million Illumina reads on a 24-core processor with 12, 24, and 48 threads respectively. Here are my results:

12 Threads - 2:44
24 Threads - 1:34
48 Threads - 1:59

**nilshomer** · 04-13-2011, 11:47 AM

See Amdahl's Law: http://en.wikipedia.org/wiki/Amdahl%27s_law

**Richard Finney** · 04-13-2011, 12:43 PM

If you have the source, edit the makefile to get rid of "-g" (turn "debug on" to "debug off") and bump up -O2 to -O3. Sometimes -Os alone (optimize for size) does the trick. The reason -Os works is because it keeps the code in cache and keeping as much as possible in the L1 or L2 cache is a great improvement on a modern CPU. This might get you a little boost. Also, if you have access to the intel C compiler, you might want to use that.

What works for me is keep threads at one(1) but launch 4 bwa processes (or as many as cores as you have on the machine) at once. Example: split input fasta in 4 files and do this at the command line

./job1 &
./job2 &
./job3 &
./job4 &
wait
echo "did all 4, dude, now ... check results"

**yujiro** · 04-14-2011, 02:49 AM

thanks

hi guys,

thanks a lot for insightful comments. it is a bit puzzling why multithreading should not work over multiple cpu's, but i will have a look at the source code. for the time being, splitting input files into the number of cpu's or threads will greatly save my time.

**dp05yk** · 04-14-2011, 05:55 AM

Whether or not multithreading works over multiple nodes is dependent on how your system hardware works. Threads need to share RAM and global variables so if your nodes each have their own RAM then threads cannot co-exist on each node. I'm guessing that if your nodes had some sort of shared memory space then it would be possible to utilize all 8 cores, but I'm no expert on computer architecture so I couldn't tell you how to go about looking into this.

**RDW** · 04-15-2011, 10:21 AM

I've just run a quick bwa aln test on a random fastq file, and found that, on my system at least, bwa benefits from multiple cores spread across two processors and from hyperthreading.

The workstation has 2 Intel x5690 processors, each with 6 cores, so a total of 12 cores. With hyperthreading enabled in the BIOS I get 24 'virtual cores'. Memory node interleaving is currently set to SMP mode - I haven't tested NUMA. I'm running a recent x64 Linux kernel. Time to complete job:

00:13:44 - 6 threads - HT disabled
00:07:45 - 12 threads - HT disabled
00:07:49 - 24 threads - HT disabled

00:14:05 - 6 threads - HT enabled
00:07:42 - 12 threads - HT enabled
00:05:33 - 24 threads - HT enabled

So on this system it's best to use as many threads as there are cores (or virtual cores with hyperthreading enabled) for the bwa aln step. Of course things will get much more complicated if I want to optimise for an entire pipeline with several single thread bottlenecks!

**dp05yk** · 04-15-2011, 10:26 AM

Makes sense - since you're able to specify a shared memory mode, that will break down the barriers originally in place by separate nodes.

**RDW** · 04-15-2011, 10:38 AM

If anything, what surprised me was the apparently significant extra benefit of enabling hyperthreading, which I'd been sceptical about. I'll have to see if this helps with GATK, etc.

**earonesty** · 04-15-2011, 12:39 PM

Obviously since your process is CPU-bound (using all cores), using faster hard drives or multiple threads shouldn't help. Giving bwa more ram could help I guess, but my experience with bwa is that it will make good use of RAM anyway.

Poor alignment quality slows down bwa because it has to work harder. One way to speed things up is to clean up your fastq file before feeding it to the aligner. (Removing N's, low quality sequence tails, adapter/primer reads, etc.)

And of course, you can just use more than 1 machine.

Topics	Statistics	Last Post
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 25 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 42 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 48 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM
A New Single-Cell Method Maps DNA-Protein Interactions by SEQadmin2 Started by SEQadmin2, 06-04-2026, 08:59 AM	0 responses 49 views 0 reactions	Last Post by SEQadmin2 06-04-2026, 08:59 AM

Unconfigured Ad

how can i speed up bwa?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News