Thanks Brian, that helps a lot! Many thanks!!
There are just two little things I'd like to clearify:

Quote:
The YI tag is a bit more complex. Identity can be calculated various ways; unfortunately, many ways overly penalize long deletions. So actually it is:
(matches)/(matches+substitutions+Ns+insertions+f(deletions)). If f(deletions) was just the number of deletions, then it would be matches/(total number of cigar operations). But then a 100bp read with 50bp match, 400bp deletion, 50bp match would get an identity of 100/(100+400) = 20%. 20% identity implies worse than even the alignment of two completely random sequences (which is expected to yield over 25%), so it's misleading, because a read with two nearby 50bp consecutive matches to another string is actually a very good alignment that could not have arisen by chance. So, I actually take the square root of the length of each individual deletion event. In this case, you would get:
(100)/(100+sqrt(400)) = 100/120= 83.3% identity.
If there were 50 individual singlebasepair deletions, however, the identity would be:
(100)/(100+50*sqrt(1)) = 100/150 = 66.7% identity, which is lower, because it's more likely to have arisen by chance.
This method of identity calculation is not perfect, of course, and incompatible with most other methods, but there is no single standard of which I am aware, and I think the results of this method are more informative than others, which are biased toward edit distance rather than modelling biological events.

I'm not sure if I understand the meaning of f(deletions). Does it just mean, that all consecutive deletions are square rooted and the results summed up?
Like in the following example from my data:
Quote:
read1
length: 202 (+11 Deletions)
cigar: 5=6X1=1X2=1X1=1I4=10D2=1D2=1X1=1X6=1I9=1I19=1I79=1I56=
Matches: 5+1+2+1+4+2+2+1+6+9+19+79+56=187
Mismatches: 6+1+1+1+1=10
Insertions: 1+1+1+1+1=5
Deletions: 10+1=11
Identity: 187/(187+10+5+11) = 87.8% (this is the "standard" calculation)
Identity: 187/(187+10+5+sqrt(10)+sqrt(1))= 90.7% (this is your calculation?)

If the former is true, why don't you treat insertions in the same way? Like in:
Quote:
read2
length: 264 (+2 Deletions)
cigar: 1X2=2X1=2X4=3X3=20I8=1D19=1D157=1X28=1X3=1X1=3I3=1X
Matches: 2+1+4+3+8+19+157+28+3+1+3=229
Mismatches: 1+2+2+3+1+1+1+1=12
Insertions: 20+3=23
Deletions: 1+1=2
Identity: 229/(229+12+23+2*sqrt(1)) = 86.1% (This is your calculation?)
Identity: 229/(229+12+sqrt(20)+sqrt(3)+2*sqrt(1)) = 91.9% (Why not this way?)


Quote:
Thanks for finding that bug, and I'm glad you're finding BBMap useful! One suggestion, though  if you are doing RNAseq on organisms with introns, I suggest you set the "intronlength" and "maxindel" flags. I normally use "intronlength=10". This does not change mapping at all, it just changes cigar strings so that deletions longer than 10bp will be reported using the symbol 'N' instead of 'D'. That may make a difference to some downstream processing tools, if they only interpret 'N' as intron. As for maxindel, introns shorter than maxindel will not be searched for. In plants or fungus, I suggest setting this to at least 16000 (that's the default for bbmap, but the pacbio modes have a much shorter default of 100); for vertebrates, I suggest 200000. I'm not really sure about invertebrates. But try to set it at around the 99th percentile of intron sizes in that organism, if you have some idea of what that might be. The lower the better (faster, more accurate) as long as it encompasses the vast majority of introns.

intronlength=10 is the standard value, isn't it? I didn't set it, but can neither find any cigar strings with deletions greater than 10D nor one with N's smaller than 11N.
And one last question out of interest as I worked in fungal genomics before I changed to humans: Why would you suggest a maxindel of 16000 for fungi? From what I analyzed (Comparative genomics of 200+ asco and basidiomycetes) they are rarely longer than 100bp and I never saw one longer than 3kb (though I'm atm not sure if this was even mitochondrial).