SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Odd MACS Warning jrmmhughes Bioinformatics 3 07-17-2016 08:57 PM
BWA produces odd alignment results dandyrilla Bioinformatics 2 11-28-2011 12:28 AM
Odd characters in samtools mpileup output Bueller_007 Bioinformatics 0 08-26-2011 05:33 PM
ODD SOLID4 behaviour RichardAllcock SOLiD 2 06-01-2011 08:49 AM
An odd error message from Tophat Mark.hz Bioinformatics 6 01-02-2011 10:34 PM

Reply
 
Thread Tools
Old 04-02-2010, 02:16 PM   #1
jmartin
Member
 
Location: St. Louis

Join Date: Dec 2009
Posts: 74
Default BWA odd behaviors

I'm seeing some alignments that don't make sense to me come out of BWA. There only parameter we are setting is '-n 20', and these are 100mer reads from metagenomic samples being mapped against a bacterial database.

Our understanding of the '-n' parameter is that its setting the max allowable edit distance between query and the reference for a good alignment, so its something like the max number of mismatches allowed. But in the SAM output we're seeing alignments where the NM:i field is showing 70-75 (NM:i is supposed to show the number of mismatches).

How can BWA even be making an alignment of a 100mer query where there are 70-75 mismatches?

Another oddity I've seen is that if we reduce the size of the input query file, the exact same reads that previously were showing 70-75 mismatches now show < 20 mismatches. We've seen this weird error in fasta files of ~600k 100mer reads, and then we've broken that query file into chunks of 5000 100mer reads, and the same reads do not give this error. But the results of the small chunks seem to not match entirely with the BLASTN results. Mainly the small chunks will either give the same hit as BLASTN, or will fail to find a hit that BLASTN finds.

Is this a known issue? Or could I be doing something wrong by failing to set some needed parameter? I'm using BWA 0.5.7 on a 64bit machine.

Thanks,
John Martin
jmartin is offline   Reply With Quote
Old 04-02-2010, 06:52 PM   #2
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Maybe Heng will comment, but I will take a shot at the first part.

[QUOTE=jmartin;16446]I'm seeing some alignments that don't make sense to me come out of BWA. There only parameter we are setting is '-n 20', and these are 100mer reads from metagenomic samples being mapped against a bacterial database.

Our understanding of the '-n' parameter is that its setting the max allowable edit distance between query and the reference for a good alignment, so its something like the max number of mismatches allowed. But in the SAM output we're seeing alignments where the NM:i field is showing 70-75 (NM:i is supposed to show the number of mismatches).

How can BWA even be making an alignment of a 100mer query where there are 70-75 mismatches?
[QUOTE]

BWA uses the first 32 bases in its initial lookup, so that your "20" mismatches can only occur in the first 32 bases (see the "-l" option). The rest of the bases are filled in later and can have any # of mismatches. Note that the algorithm is exponential with respect to the "-n" option so I am quite amused that it was even possible for the program to complete with "-n 20" (that is a greater than 60% error rate!).
nilshomer is offline   Reply With Quote
Old 04-06-2010, 05:24 PM   #3
jmartin
Member
 
Location: St. Louis

Join Date: Dec 2009
Posts: 74
Default

Heh, it actually does complete with -n 20 (I had tried -n 30 & -n 33 as well, those values did not complete on a 32Gb blade).

The reason I'd been trying such large values for -n is to overcome some ambiguous bases that seem to exist in my Illumina data. While I don't expect more than 1-2 real errors per 100bp of human data mapping to another human genome, we have a highly variable distribution of ambiguous bases that appear in the data generated from some of our metagenomic samples (I'm trying to remove human sequence from metagenomic bacterial samples harvested from various human body sites). Some sites have ~30% of the reads showing >= 20 Ns in their sequence. It depends on the body site (different collection techniques are used at each site, by different sets of hands, and with different people making the library preps).

But Heng has mentioned in another thread that bwa is really not designed for such sequence, and that its not really safe to use -n values > 7. Anyway, I appreciate the reply.
jmartin is offline   Reply With Quote
Old 04-07-2010, 03:01 AM   #4
Chipper
Senior Member
 
Location: Sweden

Join Date: Mar 2008
Posts: 324
Default

Quote:

BWA uses the first 32 bases in its initial lookup, so that your "20" mismatches can only occur in the first 32 bases (see the "-l" option). The rest of the bases are filled in later and can have any # of mismatches. Note that the algorithm is exponential with respect to the "-n" option so I am quite amused that it was even possible for the program to complete with "-n 20" (that is a greater than 60% error rate!).
BWA uses the -n parameter for the number of mismatches in the full read (-k is for the seed). Bowtie uses -n for the seed and then allows any # of mismatches in the 3' end. And then BFAST uses -n for the number of threads. Wouldn''t it be great if these parameters were standardized...?
Chipper is offline   Reply With Quote
Old 04-07-2010, 08:53 AM   #5
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Wink

Quote:
Originally Posted by Chipper View Post
BWA uses the -n parameter for the number of mismatches in the full read (-k is for the seed). Bowtie uses -n for the seed and then allows any # of mismatches in the 3' end. And then BFAST uses -n for the number of threads. Wouldn't it be great if these parameters were standardized...?
You absolutely right about BWA "-n", that is my error.

As for standardizing the options, that would be lovely (BFAST came out in mid/early 2008, look how many aligners there are now), but the differences in the algorithm are too substantial in my opinion. For example BWA and other BWT algorithms search (exponentially) over a certain # of mismatches/differences, while BFAST and spaced seed (index/hash) algorithms do not necessarily guarantee to find up to a certain # of mismatches (say 99% of reads with k # of mismatches). Therefore, there can be a parameter "up to k mismatches" in the former but not in the later.

Remember these software are usually written by graduate students (who need to graduate) or post-docs. Maybe a faculty position () would allow us to give better support and standardization.
nilshomer is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:41 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO