SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Yes .. BBMap can do that! GenoMax Bioinformatics 212 06-01-2018 12:14 PM
Introducing BBSplit: Read Binning Tool for Metagenomes and Contaminated Libraries Brian Bushnell Bioinformatics 60 06-01-2018 10:09 AM
BBmap dedupe help JamesSeward Bioinformatics 7 07-15-2016 10:20 PM
BBMap for BitSeq dietmar13 Bioinformatics 1 04-30-2015 08:40 AM
BBMap Error Phage Hunter Bioinformatics 5 01-14-2015 04:34 AM

Reply
 
Thread Tools
Old 01-18-2017, 06:11 AM   #1
MSchm
Junior Member
 
Location: Switzerland

Join Date: Jan 2017
Posts: 1
Default Several questions regarding BBMap/BBSplit

Several questions regarding BBMap/BBSplit:

1. The ambig flag -- Brian Bushnell states:
Set behavior on ambiguously mapped reads (with multiple top-scoring mapping locations)
--> How is "ambiguously mapped reads"/"top-scoring" defined exactly? Is it about reads which map "good enough" to be mapped on several positions in the first place, but the mapping quality can differ? Or what is the rule/statistic you use? Do they have to be "significantly better" than some lower-scoring reads (which still would map). Is is there some exact doc / or where in the code would I find it to see what exactly is going on? :-)

2. BBSplit: Are the criteria for ambig2 the same as ambig? (Except the fact that we are talking about different ref. genomes).

What would happen if we have the following three scenarios:

We have three top-scoring hits for one read (let's say they have score1 to score3, score1 is the best, but all are "very good" hits). We have two hits to ref1 with score1 and score3, one to ref2 with score2

Scenario 1: I have ambig=best and ambig2=best --> which aligments get reported?
Scenario 2: I have ambig=all and ambig2=best --> which aligments get reported?
Scenario 3: I have ambig=best and ambig2=all --> which aligments get reported?

3. How to intrepret the "XT" flag in the sam file (like shown in IGV):
- What does "XT = R" mean? Repeat?
- What does the flag "AM" mean?

Many thanks for this good tool!
Michael

Last edited by MSchm; 01-18-2017 at 10:51 PM.
MSchm is offline   Reply With Quote
Old 01-19-2017, 03:57 PM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Hi Michael,

1) BBMap's scoring is based on an affine-transformed alignment. It's similar to calculating % identity, except that there are different weights for insertions, deletions, and substitutions; and extending an event (like going from a length 1 deletion to a length 2 deletion) has a diminishing penalty. A bonus is also added to the score of reads that are mapped in a properly-paired configuration.

The top-scoring site is the one with the top score given the weight matrix (which is hard-coded). Generally, a site with one mismatch scores better than a site with one deletion, or a site with two mismatches, etc. The decision of whether a read is ambiguous depends on the "clearzone" which is by default roughly the penalty you get from 2 mismatches. So, if the best site A has 1 mismatch and the second-best site B has 2 mismatches, they will both be considered top-scoring sites and the read will be classified as ambiguous. If site A has 1 mismatch and site B has 5 mismatches, the read will not be considered ambiguous.

The clearzone is variable, though. Reads where the best site is perfect use a smaller clearzone (1.6*substitution penalty) while reads where the best site is a very poor match have a bigger clearzone (up to 8*substitution penalty). So if site A has 0 mismatches and site B has 2 mismatches, that would be unambiguous; but if site A has 20 mismatches and site B has 25 mismatches, that would be ambiguous.

2) The definition of reads for "ambig" and "ambig2" is identical, score-wise. However, ambig2 only considers alignments to different references. If the top site was on ref X with 1 mismatch, and the second-best was on ref Y with 2 mismatches, that would be considered ambig and ambig2. But if both sites were on ref X, that would be considered ambig but not ambig2.

Scenario 1: The top-scoring site only will be reported. If there are multiple sites within the clearzone with different scores, it will use the best only. If there is a tie, it will use the reference you specified first (so, it would go to ref1.)

Scenario 2-3: Ambig2 overrides ambig. I don't recommend setting ambig if you are using ambig2; just leave it as default. Actually, I don't recommend using BBSplit to produce sam output - the output is always valid, but it can lead to unexpected results, like a sam file that you expect to be full of alignments to the mouse genome, but the alignments reported are actually to the human genome. This will happen for reads that map ambiguously to human and mouse - you will get two sam files; for reads that map uniquely to one organism, the alignments are fine; but for reads that map ambiguously, the alignments in the sam file will be the same for reads that are in both files. So, I suggest people only use BBSplit for fasta / fastq output, then remap the output if needed with BBMap. With ambig2=best or toss it doesn't really matter, since a read will only go to at most one file, but with ambig2=all or split, the output is not what you are expecting.

3) I copied XT:A:R/XT:A:U from some other tool... probably bowtie2 or TopHat, when I was trying to make my output compatible with the Tuxedo pipeline. XT:A:R means the read was considered ambiguous, and XT:A:U means it was considered unambiguous.
Brian Bushnell is offline   Reply With Quote
Old 12-18-2017, 10:01 AM   #3
mcmc
Member
 
Location: Midwest, USA

Join Date: Jan 2016
Posts: 14
Default

Quote:
Originally Posted by Brian Bushnell View Post
The top-scoring site is the one with the top score given the weight matrix (which is hard-coded). Generally, a site with one mismatch scores better than a site with one deletion, or a site with two mismatches, etc. The decision of whether a read is ambiguous depends on the "clearzone" which is by default roughly the penalty you get from 2 mismatches. So, if the best site A has 1 mismatch and the second-best site B has 2 mismatches, they will both be considered top-scoring sites and the read will be classified as ambiguous. If site A has 1 mismatch and site B has 5 mismatches, the read will not be considered ambiguous.

The clearzone is variable, though. Reads where the best site is perfect use a smaller clearzone (1.6*substitution penalty) while reads where the best site is a very poor match have a bigger clearzone (up to 8*substitution penalty). So if site A has 0 mismatches and site B has 2 mismatches, that would be unambiguous; but if site A has 20 mismatches and site B has 25 mismatches, that would be ambiguous.
Is there a way to modify the clearzone, specifically to force a map to the higher scoring alignment? I am trying to use bbsplit & bbmap with several closely related (cultured) strains and I would like to eliminate ambiguities where possible.
Thanks,
MCMC
mcmc is offline   Reply With Quote
Reply

Tags
bbmap

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:04 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO