Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Several questions regarding BBMap/BBSplit

    Several questions regarding BBMap/BBSplit:

    1. The ambig flag -- Brian Bushnell states:
    Set behavior on ambiguously mapped reads (with multiple top-scoring mapping locations)
    --> How is "ambiguously mapped reads"/"top-scoring" defined exactly? Is it about reads which map "good enough" to be mapped on several positions in the first place, but the mapping quality can differ? Or what is the rule/statistic you use? Do they have to be "significantly better" than some lower-scoring reads (which still would map). Is is there some exact doc / or where in the code would I find it to see what exactly is going on? :-)

    2. BBSplit: Are the criteria for ambig2 the same as ambig? (Except the fact that we are talking about different ref. genomes).

    What would happen if we have the following three scenarios:

    We have three top-scoring hits for one read (let's say they have score1 to score3, score1 is the best, but all are "very good" hits). We have two hits to ref1 with score1 and score3, one to ref2 with score2

    Scenario 1: I have ambig=best and ambig2=best --> which aligments get reported?
    Scenario 2: I have ambig=all and ambig2=best --> which aligments get reported?
    Scenario 3: I have ambig=best and ambig2=all --> which aligments get reported?

    3. How to intrepret the "XT" flag in the sam file (like shown in IGV):
    - What does "XT = R" mean? Repeat?
    - What does the flag "AM" mean?

    Many thanks for this good tool!
    Michael
    Last edited by MSchm; 01-18-2017, 11:51 PM.

  • #2
    Hi Michael,

    1) BBMap's scoring is based on an affine-transformed alignment. It's similar to calculating % identity, except that there are different weights for insertions, deletions, and substitutions; and extending an event (like going from a length 1 deletion to a length 2 deletion) has a diminishing penalty. A bonus is also added to the score of reads that are mapped in a properly-paired configuration.

    The top-scoring site is the one with the top score given the weight matrix (which is hard-coded). Generally, a site with one mismatch scores better than a site with one deletion, or a site with two mismatches, etc. The decision of whether a read is ambiguous depends on the "clearzone" which is by default roughly the penalty you get from 2 mismatches. So, if the best site A has 1 mismatch and the second-best site B has 2 mismatches, they will both be considered top-scoring sites and the read will be classified as ambiguous. If site A has 1 mismatch and site B has 5 mismatches, the read will not be considered ambiguous.

    The clearzone is variable, though. Reads where the best site is perfect use a smaller clearzone (1.6*substitution penalty) while reads where the best site is a very poor match have a bigger clearzone (up to 8*substitution penalty). So if site A has 0 mismatches and site B has 2 mismatches, that would be unambiguous; but if site A has 20 mismatches and site B has 25 mismatches, that would be ambiguous.

    2) The definition of reads for "ambig" and "ambig2" is identical, score-wise. However, ambig2 only considers alignments to different references. If the top site was on ref X with 1 mismatch, and the second-best was on ref Y with 2 mismatches, that would be considered ambig and ambig2. But if both sites were on ref X, that would be considered ambig but not ambig2.

    Scenario 1: The top-scoring site only will be reported. If there are multiple sites within the clearzone with different scores, it will use the best only. If there is a tie, it will use the reference you specified first (so, it would go to ref1.)

    Scenario 2-3: Ambig2 overrides ambig. I don't recommend setting ambig if you are using ambig2; just leave it as default. Actually, I don't recommend using BBSplit to produce sam output - the output is always valid, but it can lead to unexpected results, like a sam file that you expect to be full of alignments to the mouse genome, but the alignments reported are actually to the human genome. This will happen for reads that map ambiguously to human and mouse - you will get two sam files; for reads that map uniquely to one organism, the alignments are fine; but for reads that map ambiguously, the alignments in the sam file will be the same for reads that are in both files. So, I suggest people only use BBSplit for fasta / fastq output, then remap the output if needed with BBMap. With ambig2=best or toss it doesn't really matter, since a read will only go to at most one file, but with ambig2=all or split, the output is not what you are expecting.

    3) I copied XT:A:R/XT:A:U from some other tool... probably bowtie2 or TopHat, when I was trying to make my output compatible with the Tuxedo pipeline. XT:A:R means the read was considered ambiguous, and XT:A:U means it was considered unambiguous.

    Comment


    • #3
      Originally posted by Brian Bushnell View Post
      The top-scoring site is the one with the top score given the weight matrix (which is hard-coded). Generally, a site with one mismatch scores better than a site with one deletion, or a site with two mismatches, etc. The decision of whether a read is ambiguous depends on the "clearzone" which is by default roughly the penalty you get from 2 mismatches. So, if the best site A has 1 mismatch and the second-best site B has 2 mismatches, they will both be considered top-scoring sites and the read will be classified as ambiguous. If site A has 1 mismatch and site B has 5 mismatches, the read will not be considered ambiguous.

      The clearzone is variable, though. Reads where the best site is perfect use a smaller clearzone (1.6*substitution penalty) while reads where the best site is a very poor match have a bigger clearzone (up to 8*substitution penalty). So if site A has 0 mismatches and site B has 2 mismatches, that would be unambiguous; but if site A has 20 mismatches and site B has 25 mismatches, that would be ambiguous.
      Is there a way to modify the clearzone, specifically to force a map to the higher scoring alignment? I am trying to use bbsplit & bbmap with several closely related (cultured) strains and I would like to eliminate ambiguities where possible.
      Thanks,
      MCMC

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Strategies for Sequencing Challenging Samples
        by seqadmin


        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
        03-22-2024, 06:39 AM
      • seqadmin
        Techniques and Challenges in Conservation Genomics
        by seqadmin



        The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

        Avian Conservation
        Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
        03-08-2024, 10:41 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, Yesterday, 06:37 PM
      0 responses
      8 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, Yesterday, 06:07 PM
      0 responses
      8 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 03-22-2024, 10:03 AM
      0 responses
      49 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 03-21-2024, 07:32 AM
      0 responses
      67 views
      0 likes
      Last Post seqadmin  
      Working...
      X