When running version 0.5.7 of BWA sampe to generate SAM files from paired end reads, I found that it was generally working fine, but occasionally I would get messages indicating unreasonable inferred maximum insert sizes. For example,
Running version 0.5.5 on the same data , I'd get a message like
and expect that the maximum gap size would default to the one specified by the -a argument. Normally, this kind of behavior seems to be caused by trying to match sets of reads that are unmatched or out of order (bwa expects them to be in order and matched). BWA then would use unmatched reads to infer the pair gap width and get these large numbers. I checked my data and found this was not the case, my reads were properly paired. However the SAM files indicated that some elements of a given pair were mapped to different chromosomes. I suspected that BWA 0.5.7 was not checking for excessively long gaps during its gap inference function. To work around this I modified a few lines of bwape.c in the infer_isize() function, changing
to
This effectively ignores any gap > 50000 during the gap inference process. It worked just fine, generating a maximum gap length of around 6500 (typical gap about 2500). This was consistent with other data in the set. While this is clearly a hack, it might be wise to set this ceiling to something like 10* (or 100*) the maximum size as defined by the -a command.
-- Brad
Code:
[bwa_sai2sam_pe_core] convert to sequence coordinate... [infer_isize] (25, 50, 75) percentile: (345105019, 629160202, 1506887065) [infer_isize] low and high boundaries: 49 and -2147483648 for estimating avg and std [infer_isize] inferred external isize from 29 pairs: 776017336.207 +/- 625290497.129 [infer_isize] skewness: 0.656; kurtosis: -0.642 [infer_isize] inferred maximum insert size: -861465347 (4.25 sigma)
Code:
[infer_isize] fail to infer insert size: weird pairing
Code:
if (p[0]->mapQ >= 20 && p[1]->mapQ >= 20) isizes[tot++] = (p[0]->pos < p[1]->pos)? p[1]->pos + p[1]->len - p[0]->pos : p[0]->pos + p[0]->len - p[1]->pos;
Code:
if (p[0]->mapQ >= 20 && p[1]->mapQ >= 20) { //isizes[tot++] = (p[0]->pos < p[1]->pos)? p[1]->pos + p[1]->len - p[0]->pos : p[0]->pos + p[0]->len - p[1]->pos; x = (p[0]->pos < p[1]->pos)? p[1]->pos + p[1]->len - p[0]->pos : p[0]->pos + p[0]->len - p[1]->pos; if (x < 50000) isizes[tot++] = x; // ignore outrageous values, particularly wrap around or different chrosomes...jh }
-- Brad
Comment