Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • In the SAM format, how the POS field affected by insertion and deletion

    Hi all,

    In the Sequence Alignment/Map Format Specification document, it says that the POS field is actually "1-based leftmost mapping POSition of the first matching base." So, based on my understanding, it is the first position in the reference on which the first alignment match (M, can be a sequence match or mismatch) occurs, am I right? If this is the case, I infer that (since I didn't find any such situation in my own data), if the CIGAR is like 10S1D139M, I mean if the first operator after the clip (H or S) is D (deletion), so the POS should be one after the position where the deletion happens?

    Actually, I am trying to develop a tool which can handle the read "overlap" in the pair-end data. As you know, the "overlap" will happen if our fragment is short and read length of the sequencer is long. In the "overlap" region, actually we can do more, for example, if the two of overlapped bases are not same, we think the mapping at this position is not good enough and will reduce the base quality or something similar.

    Now, the problem is I need to find that in the two reads (forward and backward), which two bases are actually from the same position in the reference sequence, this will be easy if all bases are alignment matches (M), no insertion (I) and deletion (D). But if it is not the case, I find that it is a little bit complicated to find the overlap bases, we need to consider the insertion and deletion, for example, if there is an insertion in one position of one read, to check the overlap, we need use the base after this position and the base at the insertion position on the other read in pair (assume that there is no insertion on the other read in pair) . This is just one simple case, I need to find all overlapped bases under this crazy condition and check whether they are equal or not.

    If anybody have better solution and willing to share, it will be appreciated very much. Thank you very much.

    bless~

  • #2
    "I mean if the first operator after the clip (H or S) is D (deletion), so the POS should be one after the position where the deletion happens? "

    YES. The position starts at the first M/=.

    Comment


    • #3
      Originally posted by lindenb View Post
      "I mean if the first operator after the clip (H or S) is D (deletion), so the POS should be one after the position where the deletion happens? "

      YES. The position starts at the first M/=.
      Hi, thanks for your quick reply.

      In my understanding of the SAM specification, M is equal to X/=, which means alignment match can be sequence match or mismatch , please suggest.

      bless~

      Comment


      • #4
        Correct, POS refers to the first of M=X, though you rarely see X or = in real life.

        Comment


        • #5
          This came up in discussion on the samtools-dev mailing list, I think James Bonfield constructed some good examples... Evidently the spec needs a bit more clarification here?

          Comment


          • #6
            Do you remember when that came up? I thought I recalled that but couldn't find it with some quick searching.

            Comment


            • #7
              Originally posted by dpryan View Post
              Correct, POS refers to the first of M=X, though you rarely see X or = in real life.
              I see it all the time, since BBMap outputs those by default

              "10S1D139M" is not a cigar string that should ever be produced. "D" should be internal to "M/X/=". If you see a read that violates this, just throw it away; it's nonsense.

              "I" is a bit more tricky; it can occur and be valid at the ends. In that case, those bases should be ignored with respect to the POS, just like "S" bases.

              Incidentally, I wrote another tool, BBMerge, which can merge paired reads, and adjusts the quality of overlapping bases to reflect whether or not they match. It's mainly used for merging reads by overlap, but it can merge based on mapping locations also, if you use it like this (the example assumes interleaved reads but they can be in two files also):

              bbmap.sh ref=reference.fasta in=reads.fastq outm=mapped.fastq pairedonly renamebymapping pairlen=800
              bbmerge.sh in=mapped.fq out=merged.fq usemapping parsecustom


              Sorry, bbmerge does not currently work with sam/bam files; it requires custom headers on reads to merge them based on mapping data. The first step maps the reads and adds custom headers. The "pairlen" flag when mapping restricts the maximum distance between paired reads - if the reads overlap this is negative, and if they don't it is positive; insert size = (pairlen + read1 length + read2 length).
              Last edited by Brian Bushnell; 01-09-2015, 11:10 AM.

              Comment


              • #8
                Originally posted by dpryan View Post
                Correct, POS refers to the first of M=X, though you rarely see X or = in real life.
                Great, thanks, that really clear my confusion.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM
                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                25 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                28 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                24 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-04-2024, 09:00 AM
                0 responses
                52 views
                0 likes
                Last Post seqadmin  
                Working...
                X