Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • bwa MD and cigar fields inconsistency

    Hello,

    I have sequencing data where the position and frequency of mismatches play important role in the downstream analysis. I generated short read mappings using BWA. In the samse output file I have inconsistent MD and cigar fields. As far as I saw MD field is generated from CIGAR field and should be consistent with it. Did anyone have the same problem?

    >less uniq_part_001.fastq.sam | cut -f 1-6,10,19 |grep -v "*"| head
    seqAAA_0 16 chr12 50113894 25 36M CGCCATCTGTTTTTTTTTTTTTTTTTTTTTTTTTTT MD:Z:1A34
    seqAAA_1 0 chr4 178545809 0 36M AAAAAAAAAAAAAAAAAAAAAAAAAAACATATGCCT MD:Z:34T1
    seqAAA_3 16 chr10 6381930 25 36M CAGAAGACGTTTTTTTTTTTTTTTTTTTTTTTTTTT MD:Z:8T27
    seqAAA_4 0 chr4 39577867 25 36M AAAAAAAAAAAAAAAAAAAAAAAAAAACGTTTGCCC MD:Z:27A8
    seqAAA_7 0 chr9 117092453 25 36M AAAAAAAAAAAAAAAAAAAAAAAAAAACTTATGCCC MD:Z:26C9
    seqAAA_8 16 chr20 17734163 25 36M CGGAATAAGTTTTTTTTTTTTTTTTTTTTTTTTTTT MD:Z:0A35
    seqAAA_10 0 chr8 112121071 0 36M AAAAAAAAAAAAAAAAAAAAAAAAAAACTTTAGCCG MD:Z:28A7
    seqAAA_11 16 chr2 66418968 0 36M GCATACCTCTTTTTTTTTTTTTTTTTTTTTTTTTTT MD:Z:36
    seqAAA_12 0 chr16 73425684 0 36M AAAAAAAAAAAAAAAAAAAAAAAAAAAGCTGGGACC MD:Z:34G1
    seqAAA_13 16 chr3 22762342 0 36M GGGGCATCCTTTTTTTTTTTTTTTTTTTTTTTTTTT MD:Z:1T34

    Biter Bilen

  • #2
    According to the latest (I think) SAM format spec, at http://samtools.sourceforge.net/SAM1.pdf, an M in the Cigar field is either "Match or mismatch" (sec. 2.2.3) -- as differentiated from an insertion or deletion. Only in the MD field do you get told whether it was a match or mismatch at a given position (see footnote 3 to the table in sec 2.2.4).

    So for your first read, the reference has an A in the 2nd position, where the read has a G. All other bases match.

    Not sure why Cigar works that way -- historical reasons, probably. lh3 may know more.

    SillyPoint

    Comment


    • #3
      Originally posted by SillyPoint View Post
      According to the latest (I think) SAM format spec, at http://samtools.sourceforge.net/SAM1.pdf, an M in the Cigar field is either "Match or mismatch" (sec. 2.2.3) -- as differentiated from an insertion or deletion. Only in the MD field do you get told whether it was a match or mismatch at a given position (see footnote 3 to the table in sec 2.2.4).

      So for your first read, the reference has an A in the 2nd position, where the read has a G. All other bases match.

      Not sure why Cigar works that way -- historical reasons, probably. lh3 may know more.

      SillyPoint
      I believe protein encoded cigar strings (the originator of CIGAR or run length encoding in sequence formats) differentiate between match and mismatch, although there is discussion of changing the SAM format to do this as well (among many other things). For the SAM format, you could check out the active discussion the sourceforge SAMtools developer emailing list for more information as this is an active area of discussion. Please commenting on the format here and on the mailing lists so the community and developers can respond to your needs in any upcoming SAM format.

      Comment


      • #4
        Someone told me CIGAR was first introduced in Ensembl/exonerate and was designed for nucleotide alignment. The original CIGAR only contain three operations: M/I/D where M stands for alignment match and can be a sequence match or mismatch. SAM's CIGAR is an extension and so keeps M. We are in the middle of adding new operations to differentiate sequence match and mismatch.

        Comment


        • #5
          Originally posted by lh3 View Post
          Someone told me CIGAR was first introduced in Ensembl/exonerate and was designed for nucleotide alignment. The original CIGAR only contain three operations: M/I/D where M stands for alignment match and can be a sequence match or mismatch. SAM's CIGAR is an extension and so keeps M. We are in the middle of adding new operations to differentiate sequence match and mismatch.
          I guess I should question my sources. I think Guy Slater first introduced CIGAR for Exonerate? Looking at exonerate, you are right it is only M/I/D. Anyone else know more?

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM
          • seqadmin
            Techniques and Challenges in Conservation Genomics
            by seqadmin



            The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

            Avian Conservation
            Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
            03-08-2024, 10:41 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, Yesterday, 06:37 PM
          0 responses
          12 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, Yesterday, 06:07 PM
          0 responses
          10 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-22-2024, 10:03 AM
          0 responses
          51 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-21-2024, 07:32 AM
          0 responses
          68 views
          0 likes
          Last Post seqadmin  
          Working...
          X