Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • samtools calmd problem

    Hello,

    I plan to get the mismatch and indel rate distribution along the read position from the accepted_hit.bam file generated by Tophat 1.3.3. First I use command 'samtools calmd -e accepted_hits.bam genome.fa >mdfile'. In the generated mdfile, there is an MD field, which contains mismatch and indel information.

    Some information in the SAM Format Specification file: "For example, a string ‘10A5^AC6’
    means from the leftmost reference base in the alignment, there are 10 matches followed by an A on the reference which is different from the aligned read base; the next 5 reference bases are matches followed by a 2bp deletion from the reference; the deleted sequence is AC; the last 6 bases are matches. The MD field ought to match the CIGAR string"

    The example is easy to understand. But in my mdfile some strings in the MD fields quite confuse me. For example, the following is a record from my mdfile:

    FCC00DYABXX:7:2201:4259:178522#CGATGTAT 97 chr14 19128808 255 21M3I3M3D16M1I31M4042N3M = 19128897 1167 =====================TGCGAA================T====C===========================TT BBDFABE@FBCDE:A>BCEGDCECBCDCBDDDFE<CAD>D>=A<@################################# NM:i:13 XS:A:+ NH:i:1 MD:Z:21T0G0C0^GAA20G27C0A0


    Could anyone explain this MD string for me? Any reply would be highly appreciated.

  • #2
    Originally posted by dgtnk View Post
    Hello,

    I plan to get the mismatch and indel rate distribution along the read position from the accepted_hit.bam file generated by Tophat 1.3.3. First I use command 'samtools calmd -e accepted_hits.bam genome.fa >mdfile'. In the generated mdfile, there is an MD field, which contains mismatch and indel information.

    Some information in the SAM Format Specification file: "For example, a string ‘10A5^AC6’
    means from the leftmost reference base in the alignment, there are 10 matches followed by an A on the reference which is different from the aligned read base; the next 5 reference bases are matches followed by a 2bp deletion from the reference; the deleted sequence is AC; the last 6 bases are matches. The MD field ought to match the CIGAR string"

    The example is easy to understand. But in my mdfile some strings in the MD fields quite confuse me. For example, the following is a record from my mdfile:

    FCC00DYABXX:7:2201:4259:178522#CGATGTAT 97 chr14 19128808 255 21M3I3M3D16M1I31M4042N3M = 19128897 1167 =====================TGCGAA================T====C===========================TT BBDFABE@FBCDE:A>BCEGDCECBCDCBDDDFE<CAD>D>=A<@################################# NM:i:13 XS:A:+ NH:i:1 MD:Z:21T0G0C0^GAA20G27C0A0


    Could anyone explain this MD string for me? Any reply would be highly appreciated.
    Finally I found the information I want from the sixth field.

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Essential Discoveries and Tools in Epitranscriptomics
      by seqadmin




      The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
      04-22-2024, 07:01 AM
    • seqadmin
      Current Approaches to Protein Sequencing
      by seqadmin


      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
      04-04-2024, 04:25 PM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, Yesterday, 11:49 AM
    0 responses
    15 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-24-2024, 08:47 AM
    0 responses
    16 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-11-2024, 12:08 PM
    0 responses
    61 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-10-2024, 10:19 PM
    0 responses
    60 views
    0 likes
    Last Post seqadmin  
    Working...
    X