View Single Post
Old 04-18-2011, 08:35 AM   #21
earonesty
Member
 
Location: United States of America

Join Date: Mar 2011
Posts: 52
Default

Using wgsim, it seems I was definitely wrong about N's...

Quote:
Originally Posted by earonesty View Post
You can see how "ANGGCTGC" can match many more locations with equal likelihood as opposed to "ATGGCTGC" ... which will match, on average, around 4 times fewer locations... only those with T's in the second position. The first sequence would be a poor, but still passing match for many more locations. As long as there are not enough N's to discredit a whole alignment, the aligner has to consider more possible locations as potential (albeit poor) matches.
I ran a bunch of simulations with 100K reads and yeast and human genomes, i tried seed lengths of 32 and 17. Tried each command twice and kept the better result (to avoid drive-cache issues).

When running with a seed length of 32:

1. Extra N's don't change things much. Too many extra speeds things up as sequences fail to align.

2. Increasing error rate from .02 to .04 drops performance by 10% and 20% for the 2 seed lengths, but then going much higher causes performance to increase ... as sequences fail to align.

So the error rate is the issue I probably see when cleaning up data... not N's. N's are just an artifact within the lower-quality tiles that I thought were causing issues... but weren't. I definitely have seen bwa run terribly slow on data that wasn't cleaned up... 10-20% in simulation isn't near the effect I've seen. Can't seem to reproduce it though.

Last edited by earonesty; 04-18-2011 at 08:37 AM.
earonesty is offline   Reply With Quote