Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • interpreting the CIGAR in the SAM format

    I am confused about some of the CIGAR conventions. For example, the following line has a CIGAR 14M1D24M12S

    HWI-ST156:39709NJACXX:8:1101:6153:156173 99 2L 10412529 40 14M1D24M12S = 10412564 84 CTCAGATCCATCGGTTTTCCATCTTGGTCAGGGTATCTACCATAGAAGAA @@CFFFFFHGHHHJFHIIIJJHHIIJIFHIJJJ?DGDHGHIJGFIGGIHG MD:Z:14^T24 NH:i:1 NM:i:0 SM:i:40

    It looks to me like this means that there are 14 matched (or mismatched) bases (CTCAGATCCATCGG) and then there is a deletion of 1 base, which isn't shown, and then there are 24 matched (or mismatched) bases (TTTTCCATCTTGGTCAGGGTATCT) and then there are 12 "soft clipped" bases (ACCATAGAAGAA), which means that the last 12 bases aren't matched (or mismatched) but nonetheless they are listed, because it is "soft" clipped (whereas if it had been "hard" clipped, they wouldn't have been listed).



    Here is another CIGAR: 4S38M84N8M

    HWI-ST156:39709NJACXX:8:1101:2688:151507 163 2L 21627720 40 4S38M84N8M = 21627875 125 CCTTATCCACCTTCCGCTTTACAGCCTCAATGGCGGGAGCATCTGTTGAG CCCFDEFDDFHHHIJJIJJJJGICHIIJIGGFHGHIJ6=4=3777@C7;; MD:Z:46 NH:i:1 NM:i:0 SM:i:40 XS:A:-

    I interpret this as 4 "soft-clipped" bases (CCTT), meaning that these bases aren't matched, but nonetheless they are listed, and then 38 matched (or mismatched) bases (ATCCACCTTCCGCTTTACAGCCTCAATGGCGGGAGCAT) and then skipping over 84 bases (not listed) and then 8 matched (or mismatched) bases (CTGTTGAG).


    But now, here is another CIGAR 16M3I26M5S that comes from this line:

    HWI-ST156:39709NJACXX:8:1101:2701:84150 163 2L 19400984 40 16M3I26M5S GCTTG = 19401075 149 GGTGTACAGGTGTGTGTGTGGGTGGGGGGGGGGTTGAGTGGGGGC ?7;DBDDFGD=C<E<CCFAF+BFEG<F####################### MD:Z:19G7T14 NH:i:1 NM:i:2 SM:i:40

    I would think that this means that the sequence starts with 16 matched (or mismatched) bases (GGTGTACAGGTGTGTG), then an insertion of 3 bases (TGT), then 26 matched (or mismatched) bases (GGGTGGGGGGGGGGTTGAGTGGGGGC) and then 5 "soft-clipped" bases, meaning that these bases are listed but not matched. BUT THERE ARE NOT 5 SUCH "SOFT-CLIPPED" BASES.

    Can someone help me understand these CIGAR conventions? Among other things, I would like to know exactly what "hard-clipped" and "soft-clipped" mean.

    Thank you.

    Eric

  • #2
    This is strange.

    You are right, when soft-clipping occurs 5 more bases should be on the reads (in this case the terminal side).

    The Funny thing is, that your read and score lengths don't match.
    (Read length = 45, score length = 50)

    I think there is a bug here... is there not?

    Comment


    • #3
      The SAM spec is very clear that the length of the SEQ (and QUAL) if present should equal the sum of the CIGAR M/I/S/=/X operations. This is something Picard will check and complain about.

      There does seem to be something wrong with your SAM file - where did it come from (which tool and which version)?

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Essential Discoveries and Tools in Epitranscriptomics
        by seqadmin




        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
        Yesterday, 07:01 AM
      • seqadmin
        Current Approaches to Protein Sequencing
        by seqadmin


        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
        04-04-2024, 04:25 PM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, 04-11-2024, 12:08 PM
      0 responses
      56 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 10:19 PM
      0 responses
      52 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 09:21 AM
      0 responses
      45 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-04-2024, 09:00 AM
      0 responses
      55 views
      0 likes
      Last Post seqadmin  
      Working...
      X