Hi there,
first post as I haven't been able to find an answer to this despite perusing the forum search and the SAM specification:
I'm trying to wrap my head around the optional MD tag in SAM files because a tool in my processing pipeline relies on this tag. In theory it should allow me to call SNPs/indels without looking up the reference sequence for a read. An example MD tag from a file I'm dealing with is
MD:Z:2A11G14G7G9^C3
read: GAGGAACCTTACCAAGGCTTGACATGTAGCTGCAAGCGCACGGAAACGTGTG
CIGAR: 32M1I5M1I10M1D3M
Now while the sum of the CIGAR M/I/S/=/X operations correctly equals the length of the read (52 bases, 53 when also considering the deletion/gap at position 50), I only get to 51 reference bases when I attempt to (manually) reconstruct the reference from the MD tag alone:
2A11G14G7G9^C3
in a "decompressed" form becomes
==A===========G==============G=======G=========-===
becomes the following reference sequence (first line) as compared to the true reference sequence (second line):
GAAGAACCTTACCAGGGCTTGACATGTAGGTGCAAGCGCACGGAAACCTGT
GAAGAACCTTACCAGGGCTTGACATGTAGGTG-AAGCG-GCGGAAACGTCGTG
The difference in length as well as the shift in the sequence both seem arise from the lack of a notation for the two gaps in the reference (positions 33 and 39).
Now, am I just misunderstanding the MD tag? Do I always have to consider both the CIGAR string AND the MD tag to infer the reference sequence? Or is there a notation for gaps in the reference that I simply have overlooked in the SAM specification? What I've found so far is the Regex or permitted characters on page 6:
[0-9]+(([A-Z]|\^[A-Z]+)[0-9]+)*
and the footnote on page 7 claiming that the MD field ought to match the CIGAR string (which it obviously doesn't in my example).
Thank you a lot for any insight and clarification!
nZyMe
first post as I haven't been able to find an answer to this despite perusing the forum search and the SAM specification:
I'm trying to wrap my head around the optional MD tag in SAM files because a tool in my processing pipeline relies on this tag. In theory it should allow me to call SNPs/indels without looking up the reference sequence for a read. An example MD tag from a file I'm dealing with is
MD:Z:2A11G14G7G9^C3
read: GAGGAACCTTACCAAGGCTTGACATGTAGCTGCAAGCGCACGGAAACGTGTG
CIGAR: 32M1I5M1I10M1D3M
Now while the sum of the CIGAR M/I/S/=/X operations correctly equals the length of the read (52 bases, 53 when also considering the deletion/gap at position 50), I only get to 51 reference bases when I attempt to (manually) reconstruct the reference from the MD tag alone:
2A11G14G7G9^C3
in a "decompressed" form becomes
==A===========G==============G=======G=========-===
becomes the following reference sequence (first line) as compared to the true reference sequence (second line):
GAAGAACCTTACCAGGGCTTGACATGTAGGTGCAAGCGCACGGAAACCTGT
GAAGAACCTTACCAGGGCTTGACATGTAGGTG-AAGCG-GCGGAAACGTCGTG
The difference in length as well as the shift in the sequence both seem arise from the lack of a notation for the two gaps in the reference (positions 33 and 39).
Now, am I just misunderstanding the MD tag? Do I always have to consider both the CIGAR string AND the MD tag to infer the reference sequence? Or is there a notation for gaps in the reference that I simply have overlooked in the SAM specification? What I've found so far is the Regex or permitted characters on page 6:
[0-9]+(([A-Z]|\^[A-Z]+)[0-9]+)*
and the footnote on page 7 claiming that the MD field ought to match the CIGAR string (which it obviously doesn't in my example).
Thank you a lot for any insight and clarification!
nZyMe
Comment