SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Question on the speed of bwa aln wangzkai Bioinformatics 1 10-25-2011 01:47 AM
how to speed up Mosaikaligner on Helicos reads? feng Bioinformatics 0 06-27-2011 06:39 PM
BWA seed length parameter effects on speed and accuracy oiiio Bioinformatics 0 03-29-2011 09:05 PM
Speed up sequence alignments using your video card! ECO Bioinformatics 9 03-22-2010 02:36 AM

Reply
 
Thread Tools
Old 04-13-2011, 02:03 AM   #1
yujiro
Junior Member
 
Location: tokyo

Join Date: Jul 2010
Posts: 5
Question how can i speed up bwa?

hello,

i do chip-seq on illumina and very much appreciate information in this forum. this is my first post and i would like to ask you possible ways to speed up bwa.

currently, i use a workstation with xeon e5420 (4 core) x 2 and 24 gb memory. it takes about 3 hours to alingn sinle lane reads from GA to human genome during which cpu usage remains 100% for all cores.

i have seen attempts to speed up bwa by cuda (called barracuda) by a cambridge group. interestingly, they show that 4 or 8 cpu cores do not make much difference in run time. performance of their cuda version aligner was not much different from that of bwa with 4-8 cpu cores.

how much difference do you predict it will make if i

1. use fast ssd drives (sata3, raid arrays) instead of hdd, because access to huge sequence data might become a bottleneck

2. run barracuda on nvidia tesla S2050 which has 1792 cuda cores or even on massively parallel supercomputer (with necessary optimization of the algorithm), if sequence alignment tasks can be effectively broken up into thausands of parallel processes

3. optimize bwa algorithm to make use of memories up to 64GB or more and cpu powers (multi-threading) up to 48 or more cores/threads. it is clear that multi-threading does not speed up things in the current bwa. howeve, i guess it must be possible to assign each 24 core/thread of cpu with individual chromosomes or long/short arms for instance.

any suggestions would be mostly appreciated,
yujiro is offline   Reply With Quote
Old 04-13-2011, 03:02 AM   #2
stefanoberri
Member
 
Location: Cambridge area, UK

Join Date: Jan 2010
Posts: 35
Default

Hi, I am not an expert in optimization, but here my suggestions

Quote:
Originally Posted by yujiro View Post
it takes about 3 hours to alingn sinle lane reads from GA to human genome during which cpu usage remains 100% for all cores.
If you look, you will notice that for some of the time it only uses 1 core and then when it actually do the alignement uses all that you specified. So there is a step bound to one processor. So having 10 processors will make only the actual alignment faster, not the "other" step.

Quote:
Originally Posted by yujiro View Post
1. use fast ssd drives (sata3, raid arrays) instead of hdd, because access to huge sequence data might become a bottleneck
I doubt this is the bottleneck. To simply read the file it takes 3 hour to align will take a few minutes at most.

No idea about CUDA

Quote:
Originally Posted by yujiro View Post
3. optimize bwa algorithm to make use of memories up to 64GB or more and cpu powers (multi-threading) up to 48 or more cores/threads. it is clear that multi-threading does not speed up things in the current bwa. howeve, i guess it must be possible to assign each 24 core/thread of cpu with individual chromosomes or long/short arms for instance.
More memory should not make a difference. In 3Gb of RAM it store the whole indexed genome. Having more I doubt would help unless you rediseign bwa.

Rather than splitting the reference genome, I would split the input fastq file and align them independently and then merge them back. Each sequence is aligned independently. This way you might gain some time when it is using only one processor

If anybody knows more, please let me know. I am also interested in understanding how to get the most out of my CPUs
stefanoberri is offline   Reply With Quote
Old 04-13-2011, 04:15 AM   #3
yujiro
Junior Member
 
Location: tokyo

Join Date: Jul 2010
Posts: 5
Thumbs up thanks

hi stefanoberri,

thanks a lot for your suggestions. i got your points about ssd drive and memory size. i particularly like your idea of splitting read sequences into smaller files since i can do it by writing a short script without touching bwa itself.

to rephrase my question about the memory size, i wonder if things get faster by using uncompressed genomic sequences. burrows wheeler transform serves two purposes here, if i understand correctly, one is to compress data to a few GB, another is to generate suffix trie which is used for finding substrings that match the query. i do not know how burrows wheeler compression and decompression processes are integrated with smith waterman algorithm in bwa. if you have a lot of memory and do not have to compress reference sequences, will you not save time by skipping the decompression process?

thanks a lot
yujiro is offline   Reply With Quote
Old 04-13-2011, 04:24 AM   #4
stefanoberri
Member
 
Location: Cambridge area, UK

Join Date: Jan 2010
Posts: 35
Default

Hi. I don't think bwa compress/decompress data.
The suffix trie is a way to have all the genome indexed in memory in an efficient way so that it can fit in 3GB of RAM. Maybe they did take some decisions to compromise size and efficeincy, but I don't think there is any compression (like there is in the bam file, for instance) involved that you can skip.
stefanoberri is offline   Reply With Quote
Old 04-13-2011, 05:14 AM   #5
yujiro
Junior Member
 
Location: tokyo

Join Date: Jul 2010
Posts: 5
Default

hi stefanoberri,

thanks for your comments. if you could kindly have a look at Li and Durbin's original bwa paper,

http://bioinformatics.oxfordjournals.../1754.full.pdf

they mention in section 2.6 reducing memory that inverse compressed suffix array (CSA) is obtained from occurrence array and that suffix array is calculated from inverse CSA.

by so doing they reduce memory requirement from n[log2n] to 4n+n[log2n]/8. compression of the suffix array might be an intergral part of burrows wheeler, but i wonder if these calculations can be skipped.

thanks,
yujiro is offline   Reply With Quote
Old 04-13-2011, 10:50 AM   #6
dp05yk
Member
 
Location: Brock University

Join Date: Dec 2010
Posts: 66
Default

Anyone can feel free to correct me - I may not be totally correct in this:

The thing with BWA is that it only runs on one processor at a time. So even though you have two processors, each with 4 cores (for 8 cores total), BWA will only run on one processor, and thus multithreading will be maxed out at 4 threads.

You can specify 8 threads, but I'm guessing 4 of the 8 threads will be spawned in the master process and will be executed in a pseudo-threaded manner. 8 threads will then, in effect, be slower than 4 threads, as 4 of the threads aren't being executed simultaneously and will increase competition and blockage over sequence distribution.

So multithreading actually *does* speedup BWA. 8 threads _will_ speed up more than 4, as long as you are running BWA on an 8-core processor instead of a 4-core processor.
dp05yk is offline   Reply With Quote
Old 04-13-2011, 11:06 AM   #7
dp05yk
Member
 
Location: Brock University

Join Date: Dec 2010
Posts: 66
Default

Okay, I just tested this theory and I think I'm correct. I performed 'aln' for 25 million Illumina reads on a 24-core processor with 12, 24, and 48 threads respectively. Here are my results:

12 Threads - 2:44
24 Threads - 1:34
48 Threads - 1:59
dp05yk is offline   Reply With Quote
Old 04-13-2011, 11:47 AM   #8
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

See Amdahl's Law: http://en.wikipedia.org/wiki/Amdahl%27s_law
nilshomer is offline   Reply With Quote
Old 04-13-2011, 12:43 PM   #9
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 700
Default

If you have the source, edit the makefile to get rid of "-g" (turn "debug on" to "debug off") and bump up -O2 to -O3. Sometimes -Os alone (optimize for size) does the trick. The reason -Os works is because it keeps the code in cache and keeping as much as possible in the L1 or L2 cache is a great improvement on a modern CPU. This might get you a little boost. Also, if you have access to the intel C compiler, you might want to use that.

What works for me is keep threads at one(1) but launch 4 bwa processes (or as many as cores as you have on the machine) at once. Example: split input fasta in 4 files and do this at the command line

./job1 &
./job2 &
./job3 &
./job4 &
wait
echo "did all 4, dude, now ... check results"
Richard Finney is offline   Reply With Quote
Old 04-14-2011, 02:49 AM   #10
yujiro
Junior Member
 
Location: tokyo

Join Date: Jul 2010
Posts: 5
Default thanks

hi guys,

thanks a lot for insightful comments. it is a bit puzzling why multithreading should not work over multiple cpu's, but i will have a look at the source code. for the time being, splitting input files into the number of cpu's or threads will greatly save my time.
yujiro is offline   Reply With Quote
Old 04-14-2011, 05:55 AM   #11
dp05yk
Member
 
Location: Brock University

Join Date: Dec 2010
Posts: 66
Default

Whether or not multithreading works over multiple nodes is dependent on how your system hardware works. Threads need to share RAM and global variables so if your nodes each have their own RAM then threads cannot co-exist on each node. I'm guessing that if your nodes had some sort of shared memory space then it would be possible to utilize all 8 cores, but I'm no expert on computer architecture so I couldn't tell you how to go about looking into this.
dp05yk is offline   Reply With Quote
Old 04-15-2011, 10:21 AM   #12
RDW
Member
 
Location: London

Join Date: Oct 2008
Posts: 63
Default

I've just run a quick bwa aln test on a random fastq file, and found that, on my system at least, bwa benefits from multiple cores spread across two processors and from hyperthreading.

The workstation has 2 Intel x5690 processors, each with 6 cores, so a total of 12 cores. With hyperthreading enabled in the BIOS I get 24 'virtual cores'. Memory node interleaving is currently set to SMP mode - I haven't tested NUMA. I'm running a recent x64 Linux kernel. Time to complete job:

00:13:44 - 6 threads - HT disabled
00:07:45 - 12 threads - HT disabled
00:07:49 - 24 threads - HT disabled

00:14:05 - 6 threads - HT enabled
00:07:42 - 12 threads - HT enabled
00:05:33 - 24 threads - HT enabled

So on this system it's best to use as many threads as there are cores (or virtual cores with hyperthreading enabled) for the bwa aln step. Of course things will get much more complicated if I want to optimise for an entire pipeline with several single thread bottlenecks!
RDW is offline   Reply With Quote
Old 04-15-2011, 10:26 AM   #13
dp05yk
Member
 
Location: Brock University

Join Date: Dec 2010
Posts: 66
Default

Makes sense - since you're able to specify a shared memory mode, that will break down the barriers originally in place by separate nodes.
dp05yk is offline   Reply With Quote
Old 04-15-2011, 10:38 AM   #14
RDW
Member
 
Location: London

Join Date: Oct 2008
Posts: 63
Default

If anything, what surprised me was the apparently significant extra benefit of enabling hyperthreading, which I'd been sceptical about. I'll have to see if this helps with GATK, etc.
RDW is offline   Reply With Quote
Old 04-15-2011, 12:39 PM   #15
earonesty
Member
 
Location: United States of America

Join Date: Mar 2011
Posts: 52
Default

Obviously since your process is CPU-bound (using all cores), using faster hard drives or multiple threads shouldn't help. Giving bwa more ram could help I guess, but my experience with bwa is that it will make good use of RAM anyway.

Poor alignment quality slows down bwa because it has to work harder. One way to speed things up is to clean up your fastq file before feeding it to the aligner. (Removing N's, low quality sequence tails, adapter/primer reads, etc.)

And of course, you can just use more than 1 machine.

Last edited by earonesty; 04-15-2011 at 12:41 PM.
earonesty is offline   Reply With Quote
Old 04-15-2011, 01:41 PM   #16
dp05yk
Member
 
Location: Brock University

Join Date: Dec 2010
Posts: 66
Default

Quote:
Originally Posted by earonesty View Post
Poor alignment quality slows down bwa because it has to work harder. One way to speed things up is to clean up your fastq file before feeding it to the aligner. (Removing N's, low quality sequence tails, adapter/primer reads, etc.)
Correct me if I'm wrong, but don't N's speed up BWA? As soon as the sequence being processed passes a certain mismatch/indel threshold the alignment process for said sequence is aborted, and BWA moves onto the next sequence.
dp05yk is offline   Reply With Quote
Old 04-16-2011, 11:21 AM   #17
earonesty
Member
 
Location: United States of America

Join Date: Mar 2011
Posts: 52
Default

Perhaps that's how it's supposed to work... but repeats, artifacts and primers, N's etc seem to slow things down for me. A lot of N's - I would agree. But imagine a few... these get matched as "A or T or G or C" causing lots of potential matches, and requiring each match to be more rigorously evaluated. If you trim them from the ends... it won't hurt your alignment at all (bwa may do this in newer versions? ), but you can't trim them from the middle -- and they necessitate exhaustive searching of the possibilities.

Last edited by earonesty; 04-16-2011 at 11:36 AM.
earonesty is offline   Reply With Quote
Old 04-16-2011, 12:02 PM   #18
dp05yk
Member
 
Location: Brock University

Join Date: Dec 2010
Posts: 66
Default

Quote:
Originally Posted by earonesty View Post
A lot of N's - I would agree. But imagine a few... these get matched as "A or T or G or C" causing lots of potential matches, and requiring each match to be more rigorously evaluated.
They do not get matched as "A or T or G or C". An N in an input sequence is automatically treated as a mismatch.

An N in the reference sequence is randomly assigned A/T/G/C.

This is from the BWA paper, as well, I verified this in the code.
dp05yk is offline   Reply With Quote
Old 04-16-2011, 03:23 PM   #19
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

Liang Ping and Darren Peters have pointed out that the multithreading in bwa aln is not optimal. It is fine for ~10 threads, but when you push to ~20 threads, a lot of CPU time will be wasted on synchronization. They have provided a patch which has been applied to:

https://github.com/lh3/bwa
lh3 is offline   Reply With Quote
Old 04-16-2011, 05:21 PM   #20
earonesty
Member
 
Location: United States of America

Join Date: Mar 2011
Posts: 52
Default

Quote:
Originally Posted by dp05yk View Post
They do not get matched as "A or T or G or C". An N in an input sequence is automatically treated as a mismatch.

An N in the reference sequence is randomly assigned A/T/G/C.

This is from the BWA paper, as well, I verified this in the code.
Sorry, my explanation was a terse summary of the net effect, not meant as a literal explanation of what happens.

Unlike a nucleotide mismatch, an "N" is not a "mismatch for some and a non-mismatch for others"... it's "equally a mismatch for all possible reference bases" ... which causes "more possible equal matches".

Here's a concrete example:

You can see how "ANGGCTGC" can match many more locations with equal likelihood as opposed to "ATGGCTGC" ... which will match, on average, around 4 times fewer locations... only those with T's in the second position. The first sequence would be a poor, but still passing match for many more locations. As long as there are not enough N's to discredit a whole alignment, the aligner has to consider more possible locations as potential (albeit poor) matches.

Or maybe you're right... maybe it's not the N's... maybe it's just a high error rate, or poor alignment quality, or something else. But if I every try aligning with a "uncleaned" fastq... (adapters, skewing, or especially poor quality tails) ... it's always a lot slower (and a waste of that time aligning stuff no one wants to see).

Perhaps (mostly to assure myself I'm not crazy) I'll cobble together an example that's easy to test if I have some time tomorrow.

Last edited by earonesty; 04-16-2011 at 05:30 PM.
earonesty is offline   Reply With Quote
Reply

Tags
bwa, cuda, gpu, hardware

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:14 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO