Well, I continue to ask newbie questions, but hopefully they're getting at least a little more interesting over time....
The docs for "samtools view" say quite clearly that "An alignment may be given multiple times if it is overlapping several regions", and indeed that's true. Even when the regions don't themselves overlap, as in this example:
samtools view -b full.bam chr1:23995228-23995342 chr1:23995353-23995376 > filtered.bam
...if the ranges are close enough together, the space between them will be spanned by certain reads. Any read that touches both ranges gets included twice.
In this particular case, that's 25 duplicated reads. The entire sorted block of 25 reads gets included twice, which means that the resulting BAM has out-of-order reads. If it's out of order, it can't be indexed until it's resorted, which takes as much time as the filtering did in the first place. Even once you re-sort, you've still got duplicates. Alas, rmdup doesn't seem to have a mode for removing completely identical reads!
So how do you eliminate true read duplicates in your BAMs? I could "just" stream it through a UNIX-level command, but it would have to go through SAM and back to BAM, which requires a reference list and, presumably, a lot more time.
The docs for "samtools view" say quite clearly that "An alignment may be given multiple times if it is overlapping several regions", and indeed that's true. Even when the regions don't themselves overlap, as in this example:
samtools view -b full.bam chr1:23995228-23995342 chr1:23995353-23995376 > filtered.bam
...if the ranges are close enough together, the space between them will be spanned by certain reads. Any read that touches both ranges gets included twice.
In this particular case, that's 25 duplicated reads. The entire sorted block of 25 reads gets included twice, which means that the resulting BAM has out-of-order reads. If it's out of order, it can't be indexed until it's resorted, which takes as much time as the filtering did in the first place. Even once you re-sort, you've still got duplicates. Alas, rmdup doesn't seem to have a mode for removing completely identical reads!
So how do you eliminate true read duplicates in your BAMs? I could "just" stream it through a UNIX-level command, but it would have to go through SAM and back to BAM, which requires a reference list and, presumably, a lot more time.
Comment