Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
Question on the speed of bwa aln wangzkai Bioinformatics 1 10-25-2011 01:47 AM
how to speed up Mosaikaligner on Helicos reads? feng Bioinformatics 0 06-27-2011 06:39 PM
BWA seed length parameter effects on speed and accuracy oiiio Bioinformatics 0 03-29-2011 09:05 PM
Speed up sequence alignments using your video card! ECO Bioinformatics 9 03-22-2010 02:36 AM

Thread Tools
Old 04-18-2011, 07:35 AM   #21
Location: United States of America

Join Date: Mar 2011
Posts: 52

Using wgsim, it seems I was definitely wrong about N's...

Originally Posted by earonesty View Post
You can see how "ANGGCTGC" can match many more locations with equal likelihood as opposed to "ATGGCTGC" ... which will match, on average, around 4 times fewer locations... only those with T's in the second position. The first sequence would be a poor, but still passing match for many more locations. As long as there are not enough N's to discredit a whole alignment, the aligner has to consider more possible locations as potential (albeit poor) matches.
I ran a bunch of simulations with 100K reads and yeast and human genomes, i tried seed lengths of 32 and 17. Tried each command twice and kept the better result (to avoid drive-cache issues).

When running with a seed length of 32:

1. Extra N's don't change things much. Too many extra speeds things up as sequences fail to align.

2. Increasing error rate from .02 to .04 drops performance by 10% and 20% for the 2 seed lengths, but then going much higher causes performance to increase ... as sequences fail to align.

So the error rate is the issue I probably see when cleaning up data... not N's. N's are just an artifact within the lower-quality tiles that I thought were causing issues... but weren't. I definitely have seen bwa run terribly slow on data that wasn't cleaned up... 10-20% in simulation isn't near the effect I've seen. Can't seem to reproduce it though.

Last edited by earonesty; 04-18-2011 at 07:37 AM.
earonesty is offline   Reply With Quote
Old 04-20-2011, 06:12 PM   #22
Senior Member
Location: San Diego, CA

Join Date: May 2010
Posts: 116

Originally Posted by dp05yk View Post
They do not get matched as "A or T or G or C". An N in an input sequence is automatically treated as a mismatch.
It depends which algorithm in BWA you use. I checked for bwasw and there Ns in the input sequence will be replaced by a random base. Therefore, they might not be treated as a mismatch and you will likely get different results when running BWA multiple times.
robs is offline   Reply With Quote

bwa, cuda, gpu, hardware

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

All times are GMT -8. The time now is 03:24 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO