View Single Post
Old 05-23-2014, 10:14 AM   #43
Brian Bushnell
Super Moderator
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707


For the first problem, try setting the flag "sam=1.4". By default, cigar strings are presented in Sam 1.3 format, which uses the 'M' symbol. The does not mean match - it means match OR mismatch! Sam format 1.4 fixed this bad design decision by using '=' for match and 'X' for mismatch, but by default I output sam format 1.3 because many downstream tools (such as old versions of Samtools) cannot process format 1.4. The lastest Samtools can, though. If you set "sam=1.4" then it will be more obvious how many substitutions the reads have. The NM tag tells you the edit distance, but BBMap does not determine score directly from edit distance; it uses affine transforms. So 4 consecutive substitutions will be penalized less than 4 non-adjacent substitutions, for example.

Please note that neither XM nor XS are defined in the sam spec; both are bowtie/tophat flags, and XM particularly is poorly defined. For that, I output the number of alignments with scores EXACTLY equal to the top score, even though when you set "ambig=all" I will output sites with scores that are within a certain threshold of the top score. So there can be more alignments printed than the value of XM.

For the XS tag, you can set it to "firststrand", "secondstrand", or "unstranded". But internally, "true" is equivalent to "firststand" which is equivalent to "unstranded"; all produce identical output. "secondstrand" produces the opposite sign of "firststrand". For firststrand/unstranded, if read 1 maps to the plus strand, it gets "XS:A:+"; for secondstrand, it gets "XT:A:-". Read 2 always gets the opposite of read 1.

The YS tag is the stop position - specifically, the reference location to which the rightmost base maps. So yes, it is pos+alignment length.

The YI tag is a bit more complex. Identity can be calculated various ways; unfortunately, many ways overly penalize long deletions. So actually it is:
(matches)/(matches+substitutions+Ns+insertions+f(deletions)). If f(deletions) was just the number of deletions, then it would be matches/(total number of cigar operations). But then a 100bp read with 50bp match, 400bp deletion, 50bp match would get an identity of 100/(100+400) = 20%. 20% identity implies worse than even the alignment of two completely random sequences (which is expected to yield over 25%), so it's misleading, because a read with two nearby 50bp consecutive matches to another string is actually a very good alignment that could not have arisen by chance. So, I actually take the square root of the length of each individual deletion event. In this case, you would get:
(100)/(100+sqrt(400)) = 100/120= 83.3% identity.
If there were 50 individual single-base-pair deletions, however, the identity would be:
(100)/(100+50*sqrt(1)) = 100/150 = 66.7% identity, which is lower, because it's more likely to have arisen by chance.
This method of identity calculation is not perfect, of course, and incompatible with most other methods, but there is no single standard of which I am aware, and I think the results of this method are more informative than others, which are biased toward edit distance rather than modelling biological events.

For the crash bug, that's my fault for not testing the interaction between "ambig=all" and "stoptag"; they don't work together. I'll fix that. In the mean time, you can set "saa=f" (secondaryalignmentasterisks=false) which will circumvent the problem.

Thanks for finding that bug, and I'm glad you're finding BBMap useful! One suggestion, though - if you are doing RNA-seq on organisms with introns, I suggest you set the "intronlength" and "maxindel" flags. I normally use "intronlength=10". This does not change mapping at all, it just changes cigar strings so that deletions longer than 10bp will be reported using the symbol 'N' instead of 'D'. That may make a difference to some downstream processing tools, if they only interpret 'N' as intron. As for maxindel, introns shorter than maxindel will not be searched for. In plants or fungus, I suggest setting this to at least 16000 (that's the default for bbmap, but the pacbio modes have a much shorter default of 100); for vertebrates, I suggest 200000. I'm not really sure about invertebrates. But try to set it at around the 99th percentile of intron sizes in that organism, if you have some idea of what that might be. The lower the better (faster, more accurate) as long as it encompasses the vast majority of introns.
Brian Bushnell is offline   Reply With Quote