Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Mismatch positions in BAM

    Does anyone know of a tool that can output the positions of mismatches for each alignment in a BAM file? I need the coordinates for the mismatch positions within the read as well as the coordinate in the reference sequence to which it aligns. I have found tools that can do one or the other (SAMStat, samtools mpileup, NGSUtils basecall), but they seem to give inconsistent results. Or will I have to extract the information from the CIGAR string or MD tag manually?

    Thanks!

  • #2
    Illumina reads are error-prone; most of the discrepancies you see will be artifacts. Are you trying to count artifacts?

    Comment


    • #3
      Yes, exactly. Both artifacts/errors as well as mismatches caused by anything else (e.g. true polymorphism, mapping error, etc.).

      Comment


      • #4
        I wrote a small java program for Biostars ( http://www.biostars.org/p/59647/ ) : see : https://github.com/lindenb/jvarkit/wiki/Biostar59647

        Code:
        $ java -jar dist/biostar59647.jar -r samtools-0.1.18/examples/toy.fa  samtools-0.1.18/examples/toy.bam |\
        xmllint --format - 
        
        <?xml version="1.0" encoding="UTF-8"?>
        <sam ref="/home/lindenb/samtools-0.1.18/examples/toy.bam" bam="/home/lindenb/samtools-0.1.18/examples/toy.fa">
          <!---r /home/lindenb/samtools-0.1.18/examples/toy.fa /home/lindenb/samtools-0.1.18/examples/toy.bam-->
          <read>
            <name>r001</name>
            <sequence>TTAGATAAAGAGGATACTG</sequence>
            <flags READ_PAIRED="true" READ_MAPPED_IN_PROPER_PAIR="true" READ_UNMAPPED="false" MATE_UNMAPPED="false" READ
        _REVERSE_STRAND="false" MATE_REVERSE_STRAND="true" FIRST_IN_PAIR="false" SECOND_IN_PAIR="true" NOT_PRIMARY_ALIGN
        MENT="false" READ_FAILS_VENDOR_QUALITY_CHECK="false" READ_IS_DUPLICATE="false" SUPPLEMENTARY_ALIGNMENT="false">1
        63</flags>
            <qual>30</qual>
            <chrom index="0">ref</chrom>
            <pos>7</pos>
            <cigar>8M4I4M1D3M</cigar>
            <mate-chrom index="0">ref</mate-chrom>
            <mate-pos>37</mate-pos>
            <align>
              <M read-index="1" read-base="T" ref-index="7" ref-base="T"/>
              <M read-index="2" read-base="T" ref-index="8" ref-base="T"/>
              <M read-index="3" read-base="A" ref-index="9" ref-base="A"/>
              <M read-index="4" read-base="G" ref-index="10" ref-base="G"/>
              <M read-index="5" read-base="A" ref-index="11" ref-base="A"/>
              <M read-index="6" read-base="T" ref-index="12" ref-base="T"/>
              <M read-index="7" read-base="A" ref-index="13" ref-base="A"/>
              <M read-index="8" read-base="A" ref-index="14" ref-base="A"/>
              <I read-index="9" read-base="A"/>
              <I read-index="10" read-base="G"/>
              <I read-index="11" read-base="A"/>
              <I read-index="12" read-base="G"/>
              <M read-index="13" read-base="G" ref-index="15" ref-base="G"/>
              <M read-index="14" read-base="A" ref-index="16" ref-base="A"/>
              <M read-index="15" read-base="T" ref-index="17" ref-base="T"/>
              <M read-index="16" read-base="A" ref-index="18" ref-base="A"/>
              <D ref-index="19" ref-base="G"/>
              <M read-index="17" read-base="C" ref-index="20" ref-base="C"/>
              <M read-index="18" read-base="T" ref-index="21" ref-base="T"/>
              <M read-index="19" read-base="G" ref-index="22" ref-base="G"/>
            </align>
          </read>
        (...)

        Comment


        • #5
          This is perfect. Thanks so much!

          Comment


          • #6
            if you don't want to parse the XML , you might also like that tool: https://github.com/lindenb/jvarkit/wiki/SAM2Tsv

            Code:
            $ java -jar dist/sam2tsv.jar -A  \
                -r samtools-0.1.18/examples/toy.fa 
                  samtools-0.1.18/examples/toy.sam
            
            r001	ref	0	T	7	T	M
            r001	ref	1	T	8	T	M
            r001	ref	2	A	9	A	M
            r001	ref	3	G	10	G	M
            r001	ref	4	A	11	A	M
            r001	ref	5	T	12	T	M
            r001	ref	6	A	13	A	M
            r001	ref	7	A	14	A	M
            r001	ref	8	A	.	.	I
            r001	ref	9	G	.	.	I
            r001	ref	10	A	.	.	I
            r001	ref	11	G	.	.	I
            r001	ref	12	G	15	G	M
            r001	ref	13	A	16	A	M
            r001	ref	14	T	17	T	M
            r001	ref	15	A	18	A	M
            r001	ref	.	.	19	G	D
            r001	ref	16	C	20	C	M
            r001	ref	17	T	21	T	M
            r001	ref	18	G	22	G	M
            :   ref        7 TTAGATAAAGAGGATA-CTG 22      
            :                ||||||||    |||| |||
            :  r001        1 TTAGATAA----GATAGCTG 19      
            r002	ref	0	A	.	.	S
            r002	ref	1	A	.	.	I
            r002	ref	2	A	.	.	I
            r002	ref	3	A	9	A	M
            r002	ref	4	G	10	G	M

            Comment


            • #7
              Great. I was wondering if it is possible for this tool to also extract the location of deletions in the reads. I see that for insertions, it reports the location of the insertion in the read but not in the reference. I realize that if the sequence is not present in the read, then there can't really be an index of that position, but the position after which the deletion occurs would be really helpful. Maybe there is another way to get this information from the CIGAR string?

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Essential Discoveries and Tools in Epitranscriptomics
                by seqadmin




                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                Yesterday, 07:01 AM
              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              58 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              53 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              45 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-04-2024, 09:00 AM
              0 responses
              55 views
              0 likes
              Last Post seqadmin  
              Working...
              X