SEQanswers (
-   Bioinformatics (
-   -   Need help defining BWA parameters for human screening (

jmartin 04-02-2010 12:07 PM

Need help defining BWA parameters for human screening
I'm working with metagenomic bacterial samples from various human body sites, and these samples have various levels of human contamination. I have a large number of 100mer Illumina reads and I'm trying to setup a pipeline that will use BWA to identify human contaminant from these metagenomic communities of reads. The reference I am using is basically NCBI build 36 human genome. I have build 37 as well, but its > 4Gb in size so for now I'm sticking with b36.

I've built a control query file of reads that are of known bacterial origin, and into that I have spiked a set of known human reads. And I'm trying to assess various BWA parameters using that control set.

BWA is doing surprisingly poorly in what I had thought would be a simple application of the aligner. My control set contains:

40 million - bacterial reads
4 million - human reads

(about 1 lane's worth of data from a 50Gb Illumina run which is our current standard).

My first thought was to simply try setting a simple mismatch/edit-distance cutoff. I've tried:

(parameter) - (result)
-n 15 - found 73.1% of human spike, falsely identified 3593/40mil bacteria
-n 20 - found 76.3% of human spike, falsely identified 3628/40mil bacteria
-n 25 - found 79.3% of human spike, falsely identified 3710/40mil bacteria

This seems quite low. Then I found some cases where a read that clearly had only 3 mismatches (confirmed using BLASTN) was NOT being found by BWA. After some head scratching, I realized that those 3 mismatches were occuring within the first 30 bases of the read, and it looks like the default seeding behavior of BWA for 100mer reads is to allow only 2 mismatches in the first 30 bases. In order to catch these cases where > 2 mismatches occur early in the read, I tried setting the seed to the full length of my read, and then allowing the same number of mismatches in the seed as I do in the full read. I tried:

-n 25 -k 25 -l 99 (for 100mer reads)
found 62.1% of human spike, falsely identified 31/40mil bacteria

but as you can see, those parameters did significantly worse than just using -n 25. This doesn't make sense to me because I had thought those parameters would push the seeding length up to 99bp, and allow 25 mismatches in the seed. And then allow those same 25 mismatches across the full read. So I had expected to find MORE of the human spike than did the -n 25 setting by itself.

Basically I just need to come up with a set of parameters that can do this screening job. I'd love a deeper explanation of why the -n 25 -k 25 -l 99 parameter set found fewer hits than -n 25 does by itself. But I am really hoping that someone here can suggest parameters I can use to do this human screen. Any suggestions?

John Martin

lh3 04-02-2010 01:21 PM

Firstly, please DO NOT use the toplevel file from Ensembl. It has >1Gbp ambiguous bases as space holder. You may use the b37 reference file here:

Secondly, bwa-short would not work well for large -n. This is by design. You may consider disable seeding with -l 10000, but I guess it will be impractically slow. What is the sequencing error rate you are simulating? It seems very high. Typical Illumina sequencing error rate is only ~1% which means an 100bp read only has a couple of errors.

jmartin 04-02-2010 02:23 PM

The expected actual variation between individuals is < 1%, but the sequencing error we're seeing from the Illumina 100mer reads varies quite a bit. I used a collection of 100mer reads from a different project being done here at WashU for my human control data, I did not simulate errors for that. They were real 100mer Illumina reads from a different individual (I did not use the Ensembl/NCBI human build for the control).

I am not sure what sequencing error rate I should expect to see in these Illumina 100mers. I do see some fraction of the reads containing a significant number of ambiguous bases (N). Maybe ~1-2% of the reads will have > 3 ambiguous bases. I was trying such a high -n setting to allow reads containing Ns to align to human.

What is the largest value of -n that bwa-short can safely use in alignments?

lh3 04-02-2010 03:22 PM

I usually recommend the default. For 100bp reads, the default allows 5 mismatches/gaps. You may use -n 6 or 7, but not more. If I were analyzing the data, I would throw those ~1-2% reads in the first place. In addition, are these purity filtered reads? Do the reads have poor-quality tail? You may try -q15 to trim the tail. In general, i guess your data is non-typical. For single-end 100bp reads, bwa should map more than 90% to human; with paired-end, even more.

jmartin 04-02-2010 03:52 PM

I will give your suggestion a try. I am not sure if the tail is poor quality or not for 100mer illumina reads, but I can use my control set to play around with the -q setting.

Its very interesting to me to hear that the max suggested -n value is 7. The fact that we are doing alignments with settings as high as -n 25 may explain some of the weird behaviors we are seeing with the aligner.

Thanks for the help.

All times are GMT -8. The time now is 09:22 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.