I'm trying to hash out a set of parameters for bwa that I can use to screen illumina 75mer & 100mer reads for contaminant. I'm expecting a large volume of environmental data, and I want to clean the incoming data by filtering out certain organisms.
I'm worried that the default parameters for bwa may prove too stringent. But I'm not sure which parameters to relax without introducing too many false positives into my results.
I have a set of data spiked with contaminant to use for testing, but I don't really know where to start. The bwa man page says setting an INT value for '-n' sets the "maximum edit distance", and that seems to set the max number of mismatches allowed. And I can watch results shift as I vary that number, and in most cases it seems to make sense. But that seems at odds with the arguments '-k' & '-l' with are supposed to set the seedlength (which is by default 'inf' which I guess implies the full length of my query), and the max "edit distance" (mismatches?) in the seed (default is 2). But I can vary the '-n' value and allow more than 2 mismatches even though I'm using the default settings for '-k' & '-l' (which should kill the read if there are more than 2 mismatches?)
Is there any more detailed discussion of parameter suggestions for real world problems? I'm pretty old school, and BLAST had that wonderful oreilly blast book that had a number of examples that gave you a starting point for different tasks, then I could tweak values from there. But I feel kinda lost with BWA. Can anyone suggest another resource beyond the man page & the bwt paper that might help me out?
I'm worried that the default parameters for bwa may prove too stringent. But I'm not sure which parameters to relax without introducing too many false positives into my results.
I have a set of data spiked with contaminant to use for testing, but I don't really know where to start. The bwa man page says setting an INT value for '-n' sets the "maximum edit distance", and that seems to set the max number of mismatches allowed. And I can watch results shift as I vary that number, and in most cases it seems to make sense. But that seems at odds with the arguments '-k' & '-l' with are supposed to set the seedlength (which is by default 'inf' which I guess implies the full length of my query), and the max "edit distance" (mismatches?) in the seed (default is 2). But I can vary the '-n' value and allow more than 2 mismatches even though I'm using the default settings for '-k' & '-l' (which should kill the read if there are more than 2 mismatches?)
Is there any more detailed discussion of parameter suggestions for real world problems? I'm pretty old school, and BLAST had that wonderful oreilly blast book that had a number of examples that gave you a starting point for different tasks, then I could tweak values from there. But I feel kinda lost with BWA. Can anyone suggest another resource beyond the man page & the bwt paper that might help me out?
Comment