Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • SAM format MD tag with gaps in reference?

    Hi there,

    first post as I haven't been able to find an answer to this despite perusing the forum search and the SAM specification:

    I'm trying to wrap my head around the optional MD tag in SAM files because a tool in my processing pipeline relies on this tag. In theory it should allow me to call SNPs/indels without looking up the reference sequence for a read. An example MD tag from a file I'm dealing with is

    MD:Z:2A11G14G7G9^C3

    read: GAGGAACCTTACCAAGGCTTGACATGTAGCTGCAAGCGCACGGAAACGTGTG
    CIGAR: 32M1I5M1I10M1D3M

    Now while the sum of the CIGAR M/I/S/=/X operations correctly equals the length of the read (52 bases, 53 when also considering the deletion/gap at position 50), I only get to 51 reference bases when I attempt to (manually) reconstruct the reference from the MD tag alone:

    2A11G14G7G9^C3

    in a "decompressed" form becomes

    ==A===========G==============G=======G=========-===

    becomes the following reference sequence (first line) as compared to the true reference sequence (second line):

    GAAGAACCTTACCAGGGCTTGACATGTAGGTGCAAGCGCACGGAAACCTGT
    GAAGAACCTTACCAGGGCTTGACATGTAGGTG-AAGCG-GCGGAAACGTCGTG

    The difference in length as well as the shift in the sequence both seem arise from the lack of a notation for the two gaps in the reference (positions 33 and 39).

    Now, am I just misunderstanding the MD tag? Do I always have to consider both the CIGAR string AND the MD tag to infer the reference sequence? Or is there a notation for gaps in the reference that I simply have overlooked in the SAM specification? What I've found so far is the Regex or permitted characters on page 6:

    [0-9]+(([A-Z]|\^[A-Z]+)[0-9]+)*

    and the footnote on page 7 claiming that the MD field ought to match the CIGAR string (which it obviously doesn't in my example).

    Thank you a lot for any insight and clarification!

    nZyMe

  • #2
    Actually, what I wrote above is wrong. MD tags don't store insertions, so yes, you need to look at both of them. The MD tag only stores information on the bases in the read that align to the reference, so insertions are ignored.

    Comment


    • #3
      Thanks for your response, dryan!

      Assuming it is a bug in the aligner (very likely as it is unpublished, alpha-stage software): what should the MD tag look like to match the CIGAR string? I can't seem to find a notation for the missing two indels. Once I've found it I can post a bug report with the developer and fix the files with a custom script.

      Comment


      • #4
        Have a look at my most recent reply above (or the reply I just sent to the samtools list). My initial post was incorrect (that's what happens when I post before drinking my coffee!).

        Comment


        • #5
          Thanks for the clarification! Too bad, I was hoping I could just skip the CIGAR string and focus on the MD tag alone.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Essential Discoveries and Tools in Epitranscriptomics
            by seqadmin




            The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
            04-22-2024, 07:01 AM
          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, Today, 11:49 AM
          0 responses
          12 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, Yesterday, 08:47 AM
          0 responses
          16 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          61 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          60 views
          0 likes
          Last Post seqadmin  
          Working...
          X