View Single Post
Old 05-21-2013, 01:39 PM   #2
Junior Member
Location: Chicago

Join Date: May 2013
Posts: 2

So I've looked into this a little further and thought I'd just put up what I'm currently thinking in case anyone has something else to contribute.

It looks like while the bwa samse/sampe produces X0/X1 tags, the bwa mem/bwa bwasw do not produce such tags. Instead (and now just specifically referring to bwa mem), it looks like the XP tag is used to identify both 1) long reads that have split alignments and 2) reads that have multiple alignments overall. In the former situation (of a long read being split), both reads will have a XP tag. Using the -M flag will cause the shorter read of the two to be flagged as "the alignment is not primary" in the bitwise FLAG field (column 2). Not including the -M flag will leave both reads as appearing 'primary' (which I'm guessing is what causes the issues with Picard/GATK that the documentation mentions?). In the latter situation (of a read having multiple hits overall), you can see the other sequence hits using the -a flag. In these situations, the 'primary' read does not contain a XP tag but the sub-primary reads do.

Given this situation, I still think it isn't straight forward to determine whether a read is 'uniquely mapped' or not. I've read about using the MAPQ value as a way to determine this, e.g. a MAPQ of 0 means the read is not uniquely mapped, but I feel like I see examples where the MAPQ is >0 and there are multiple alignments for that read. For example:

HXBHIOF01CJXWW 256 X 94956752 0 23M1D5M2I33M6S * 0 0 * * NM:i:5 AS:i:36 XP:Z:10,+69647562,23M1D5M2I39M,22,5;
HXBHIOF01CJXWW 272 19 16193041 0 39M2I5M1D23M * 0 0 * * NM:i:7 AS:i:36 XP:Z:10,+69647562,23M1D5M2I39M,22,5;
HXBHIOF01CJXWW 256 3 169612909 0 23M1D5M2I39M * 0 0 * * NM:i:8 AS:i:35 XP:Z:10,+69647562,23M1D5M2I39M,22,5;
HXBHIOF01CJXWW 256 12 122144914 0 22M1D6M2I26M13S * 0 0 * * NM:i:4 AS:i:34 XP:Z:10,+69647562,23M1D5M2I39M,22,5;

The first read has a MAPQ of 22 even though it maps to other regions in the genome (all with MAPQ of 0, admittedly). Additionally this is not a long read that has been split (excluding the -a flag only produces the first read).

So I guess my current thinking is that I should still use a MAPQ threshold (10 or 30?), but that this may not produce a full set of 'uniquely mapping' reads?
mturchin20 is offline   Reply With Quote