Seqanswers Leaderboard Ad

**Brian Bushnell** · 01-19-2017, 04:57 PM

Hi Michael,

1) BBMap's scoring is based on an affine-transformed alignment. It's similar to calculating % identity, except that there are different weights for insertions, deletions, and substitutions; and extending an event (like going from a length 1 deletion to a length 2 deletion) has a diminishing penalty. A bonus is also added to the score of reads that are mapped in a properly-paired configuration.

The top-scoring site is the one with the top score given the weight matrix (which is hard-coded). Generally, a site with one mismatch scores better than a site with one deletion, or a site with two mismatches, etc. The decision of whether a read is ambiguous depends on the "clearzone" which is by default roughly the penalty you get from 2 mismatches. So, if the best site A has 1 mismatch and the second-best site B has 2 mismatches, they will both be considered top-scoring sites and the read will be classified as ambiguous. If site A has 1 mismatch and site B has 5 mismatches, the read will not be considered ambiguous.

The clearzone is variable, though. Reads where the best site is perfect use a smaller clearzone (1.6*substitution penalty) while reads where the best site is a very poor match have a bigger clearzone (up to 8*substitution penalty). So if site A has 0 mismatches and site B has 2 mismatches, that would be unambiguous; but if site A has 20 mismatches and site B has 25 mismatches, that would be ambiguous.

2) The definition of reads for "ambig" and "ambig2" is identical, score-wise. However, ambig2 only considers alignments to different references. If the top site was on ref X with 1 mismatch, and the second-best was on ref Y with 2 mismatches, that would be considered ambig and ambig2. But if both sites were on ref X, that would be considered ambig but not ambig2.

Scenario 1: The top-scoring site only will be reported. If there are multiple sites within the clearzone with different scores, it will use the best only. If there is a tie, it will use the reference you specified first (so, it would go to ref1.)

Scenario 2-3: Ambig2 overrides ambig. I don't recommend setting ambig if you are using ambig2; just leave it as default. Actually, I don't recommend using BBSplit to produce sam output - the output is always valid, but it can lead to unexpected results, like a sam file that you expect to be full of alignments to the mouse genome, but the alignments reported are actually to the human genome. This will happen for reads that map ambiguously to human and mouse - you will get two sam files; for reads that map uniquely to one organism, the alignments are fine; but for reads that map ambiguously, the alignments in the sam file will be the same for reads that are in both files. So, I suggest people only use BBSplit for fasta / fastq output, then remap the output if needed with BBMap. With ambig2=best or toss it doesn't really matter, since a read will only go to at most one file, but with ambig2=all or split, the output is not what you are expecting.

3) I copied XT:A:R/XT:A:U from some other tool... probably bowtie2 or TopHat, when I was trying to make my output compatible with the Tuxedo pipeline. XT:A:R means the read was considered ambiguous, and XT:A:U means it was considered unambiguous.

**mcmc** · 12-18-2017, 11:01 AM

Originally posted by Brian Bushnell View Post

The top-scoring site is the one with the top score given the weight matrix (which is hard-coded). Generally, a site with one mismatch scores better than a site with one deletion, or a site with two mismatches, etc. The decision of whether a read is ambiguous depends on the "clearzone" which is by default roughly the penalty you get from 2 mismatches. So, if the best site A has 1 mismatch and the second-best site B has 2 mismatches, they will both be considered top-scoring sites and the read will be classified as ambiguous. If site A has 1 mismatch and site B has 5 mismatches, the read will not be considered ambiguous.

The clearzone is variable, though. Reads where the best site is perfect use a smaller clearzone (1.6*substitution penalty) while reads where the best site is a very poor match have a bigger clearzone (up to 8*substitution penalty). So if site A has 0 mismatches and site B has 2 mismatches, that would be unambiguous; but if site A has 20 mismatches and site B has 25 mismatches, that would be ambiguous.

Is there a way to modify the clearzone, specifically to force a map to the higher scoring alignment? I am trying to use bbsplit & bbmap with several closely related (cultured) strains and I would like to eliminate ambiguities where possible.
Thanks,
MCMC

Topics	Statistics	Last Post
Evaluating Genome Sequencing for ECMO Patients in the NICU by seqadmin Started by seqadmin, 12-17-2024, 10:28 AM	0 responses 33 views 0 likes	Last Post by seqadmin 12-17-2024, 10:28 AM
New Genetic Toolkit Refines Studies on Gene Function and Disease by seqadmin Started by seqadmin, 12-13-2024, 08:24 AM	0 responses 49 views 0 likes	Last Post by seqadmin 12-13-2024, 08:24 AM
Study Links Brain Mechanism to Emotional Responses in Animals and Humans by seqadmin Started by seqadmin, 12-12-2024, 07:41 AM	0 responses 34 views 0 likes	Last Post by seqadmin 12-12-2024, 07:41 AM
Study Identifies Ribosomal RNA Fingerprints as Early Cancer Biomarkers by seqadmin Started by seqadmin, 12-11-2024, 07:45 AM	0 responses 46 views 0 likes	Last Post by seqadmin 12-11-2024, 07:45 AM

Seqanswers Leaderboard Ad

Announcement

Several questions regarding BBMap/BBSplit

Comment

Comment

Latest Articles

ad_right_rmr

News