I am trying to use bwa and samtools to map 76 bp reads from multiple bacterial strains back onto a reference sequence, with the ultimate goal of extracting SNP frequencies. To obtain accurate variant frequencies, it is important to me to remove PCR duplicates. It occurred to me that quality trimming reads with BWA using the "-q" flag during alignment could affect how well rmdup works downstream. Say 2 reads are PCR duplicates, but one is rather low-quality and is trimmed to a different length than the other. The alignment start and stop positions would no longer be the same for these duplicates, and they would not be filtered using samtools rmdup, which requires identical external coordinates.
Is this right? Does BWA do hard trimming of reads with the "-q" flag? Or does it ignore low quality bases when calculating the alignment, but still use them to determine alignment coordinates?
While I'm at it, could anyone explain how the BWA quality trimmer works? Please don't say "read the man page," I did, and was very confused by the explanation there:
Noob question #3: Does rmdup work for single read data or doesn't it? The samtools man page states:
However, there is an option "-s" to use rmdup on single read data, and when I apply it to an alignment, it looks to reduce the size of the resulting bam file. The man page and the command itself don't agree!
Is this right? Does BWA do hard trimming of reads with the "-q" flag? Or does it ignore low quality bases when calculating the alignment, but still use them to determine alignment coordinates?
While I'm at it, could anyone explain how the BWA quality trimmer works? Please don't say "read the man page," I did, and was very confused by the explanation there:
Code:
-q INT Parameter for read trimming. BWA trims a read down to argmax_x{\sum_{i=x+1}^l(INT-q_i)} if q_l<INT where l is the original read length. [0]
Code:
Samtools’ rmdup does not work for single-end data and does not remove duplicates across chromosomes. Picard is better.
Comment