![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Odd MACS Warning | jrmmhughes | Bioinformatics | 3 | 07-17-2016 08:57 PM |
BWA produces odd alignment results | dandyrilla | Bioinformatics | 2 | 11-28-2011 12:28 AM |
Odd characters in samtools mpileup output | Bueller_007 | Bioinformatics | 0 | 08-26-2011 05:33 PM |
ODD SOLID4 behaviour | RichardAllcock | SOLiD | 2 | 06-01-2011 08:49 AM |
An odd error message from Tophat | Mark.hz | Bioinformatics | 6 | 01-02-2011 10:34 PM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: St. Louis Join Date: Dec 2009
Posts: 74
|
![]()
I'm seeing some alignments that don't make sense to me come out of BWA. There only parameter we are setting is '-n 20', and these are 100mer reads from metagenomic samples being mapped against a bacterial database.
Our understanding of the '-n' parameter is that its setting the max allowable edit distance between query and the reference for a good alignment, so its something like the max number of mismatches allowed. But in the SAM output we're seeing alignments where the NM:i field is showing 70-75 (NM:i is supposed to show the number of mismatches). How can BWA even be making an alignment of a 100mer query where there are 70-75 mismatches? Another oddity I've seen is that if we reduce the size of the input query file, the exact same reads that previously were showing 70-75 mismatches now show < 20 mismatches. We've seen this weird error in fasta files of ~600k 100mer reads, and then we've broken that query file into chunks of 5000 100mer reads, and the same reads do not give this error. But the results of the small chunks seem to not match entirely with the BLASTN results. Mainly the small chunks will either give the same hit as BLASTN, or will fail to find a hit that BLASTN finds. Is this a known issue? Or could I be doing something wrong by failing to set some needed parameter? I'm using BWA 0.5.7 on a 64bit machine. Thanks, John Martin |
![]() |
![]() |
![]() |
#2 |
Nils Homer
Location: Boston, MA, USA Join Date: Nov 2008
Posts: 1,285
|
![]()
Maybe Heng will comment, but I will take a shot at the first part.
[QUOTE=jmartin;16446]I'm seeing some alignments that don't make sense to me come out of BWA. There only parameter we are setting is '-n 20', and these are 100mer reads from metagenomic samples being mapped against a bacterial database. Our understanding of the '-n' parameter is that its setting the max allowable edit distance between query and the reference for a good alignment, so its something like the max number of mismatches allowed. But in the SAM output we're seeing alignments where the NM:i field is showing 70-75 (NM:i is supposed to show the number of mismatches). How can BWA even be making an alignment of a 100mer query where there are 70-75 mismatches? [QUOTE] BWA uses the first 32 bases in its initial lookup, so that your "20" mismatches can only occur in the first 32 bases (see the "-l" option). The rest of the bases are filled in later and can have any # of mismatches. Note that the algorithm is exponential with respect to the "-n" option so I am quite amused that it was even possible for the program to complete with "-n 20" (that is a greater than 60% error rate!). |
![]() |
![]() |
![]() |
#3 |
Member
Location: St. Louis Join Date: Dec 2009
Posts: 74
|
![]()
Heh, it actually does complete with -n 20 (I had tried -n 30 & -n 33 as well, those values did not complete on a 32Gb blade).
The reason I'd been trying such large values for -n is to overcome some ambiguous bases that seem to exist in my Illumina data. While I don't expect more than 1-2 real errors per 100bp of human data mapping to another human genome, we have a highly variable distribution of ambiguous bases that appear in the data generated from some of our metagenomic samples (I'm trying to remove human sequence from metagenomic bacterial samples harvested from various human body sites). Some sites have ~30% of the reads showing >= 20 Ns in their sequence. It depends on the body site (different collection techniques are used at each site, by different sets of hands, and with different people making the library preps). But Heng has mentioned in another thread that bwa is really not designed for such sequence, and that its not really safe to use -n values > 7. Anyway, I appreciate the reply. |
![]() |
![]() |
![]() |
#4 | |
Senior Member
Location: Sweden Join Date: Mar 2008
Posts: 324
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
#5 | |
Nils Homer
Location: Boston, MA, USA Join Date: Nov 2008
Posts: 1,285
|
![]() Quote:
As for standardizing the options, that would be lovely (BFAST came out in mid/early 2008, look how many aligners there are now), but the differences in the algorithm are too substantial in my opinion. For example BWA and other BWT algorithms search (exponentially) over a certain # of mismatches/differences, while BFAST and spaced seed (index/hash) algorithms do not necessarily guarantee to find up to a certain # of mismatches (say 99% of reads with k # of mismatches). Therefore, there can be a parameter "up to k mismatches" in the former but not in the later. Remember these software are usually written by graduate students (who need to graduate) or post-docs. Maybe a faculty position ( ![]() |
|
![]() |
![]() |
![]() |
Thread Tools | |
|
|