Hi,
I have a question about the "inferred insert size" field in the sam/bam files produced by BWA (this field is referred to as the "TLEN" field in the sam format spec).
It seems like BWA computes two different things depending on the extent to which two reads from a mate pair overlap. To illustrate, I'll
use two examples from 75bp PE RNA-seq data mapped to the genome with BWA version 0.5.9-r16. I've removed the fields after the TLEN field for clarity and
I use "insert size" to refer to the length of sequence between the adaptors.
Example 1:
HS4_07580:1:2304:6332:83264#1 99 1 1276418 60 75M = 1276491 148
Example 2:
HS4_07580:1:2103:1916:45225#1 97 1 793910 37 75M = 793838 3
1. Example 1 - inferred insert size is = (rightmost coordinate + read length) - leftmost coordinate : 1276491+75-1276418 = 148
This one makes sense to me, as we get the total length of the sequence between the two adaptors, barring any indels etc. This seems to be the behaviour
when mates overlap by <= 2bp or don't overlap at all.
2. In example 2, the inferred insert size is: (leftmost coordinate + read length) - rightmost coordinate : 793838+75-793910 = 3
That is, it gives the overlap of the two reads, which is not the same thing as the insert size. To me it seems like it should be
(793910 + 75) - 793838 = 147
This seems to be the behaviour when mates overlap by >=3bp
Perhaps this is not so bad, because you can compute the correct insert size yourself. However, BWA also calls relatively large inserts that overlap by a small amount as "improper pairs" (i.e. defined, I think, as falling ~3sd from the mean of the empirical distribution of insert sizes that BWA computes during mapping). So, example 1 gets flagged as a "proper pair" because it has an insert size of 148bp, while example 2 gets flagged as an "improper pair" even though the (true?) insert size is 147bp. Even worse, as the overlap between two mate pairs increases they start getting flagged as "proper" again, even though,depending on your experiment, these smaller inserts are exactly the ones you might want to remove. So, for example:
HS4_07580:1:1205:13102:38186#1 163 1 569368 46 75M = 569371 78
This mate pair is flagged as proper even though its true length is 78bp i.e. 69bp shorter than the mate pair in example 2 above
My questions are:
Is there a sensible reason for this behaviour?
Is this behaviour been documented anywhere (I can't find anything on the BWA manual page)?
Has anybody else noticed this?
Sorry if this is old news and has already been resolved on here.
Thanks for any and all help
Dan
I have a question about the "inferred insert size" field in the sam/bam files produced by BWA (this field is referred to as the "TLEN" field in the sam format spec).
It seems like BWA computes two different things depending on the extent to which two reads from a mate pair overlap. To illustrate, I'll
use two examples from 75bp PE RNA-seq data mapped to the genome with BWA version 0.5.9-r16. I've removed the fields after the TLEN field for clarity and
I use "insert size" to refer to the length of sequence between the adaptors.
Example 1:
HS4_07580:1:2304:6332:83264#1 99 1 1276418 60 75M = 1276491 148
Example 2:
HS4_07580:1:2103:1916:45225#1 97 1 793910 37 75M = 793838 3
1. Example 1 - inferred insert size is = (rightmost coordinate + read length) - leftmost coordinate : 1276491+75-1276418 = 148
This one makes sense to me, as we get the total length of the sequence between the two adaptors, barring any indels etc. This seems to be the behaviour
when mates overlap by <= 2bp or don't overlap at all.
2. In example 2, the inferred insert size is: (leftmost coordinate + read length) - rightmost coordinate : 793838+75-793910 = 3
That is, it gives the overlap of the two reads, which is not the same thing as the insert size. To me it seems like it should be
(793910 + 75) - 793838 = 147
This seems to be the behaviour when mates overlap by >=3bp
Perhaps this is not so bad, because you can compute the correct insert size yourself. However, BWA also calls relatively large inserts that overlap by a small amount as "improper pairs" (i.e. defined, I think, as falling ~3sd from the mean of the empirical distribution of insert sizes that BWA computes during mapping). So, example 1 gets flagged as a "proper pair" because it has an insert size of 148bp, while example 2 gets flagged as an "improper pair" even though the (true?) insert size is 147bp. Even worse, as the overlap between two mate pairs increases they start getting flagged as "proper" again, even though,depending on your experiment, these smaller inserts are exactly the ones you might want to remove. So, for example:
HS4_07580:1:1205:13102:38186#1 163 1 569368 46 75M = 569371 78
This mate pair is flagged as proper even though its true length is 78bp i.e. 69bp shorter than the mate pair in example 2 above
My questions are:
Is there a sensible reason for this behaviour?
Is this behaviour been documented anywhere (I can't find anything on the BWA manual page)?
Has anybody else noticed this?
Sorry if this is old news and has already been resolved on here.
Thanks for any and all help
Dan
Comment