Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Are intron size and softclipped reads included in TLEN field of SAM format?

    I would like to know whether I should include the intron size (CIGAR N) and softclipped reads (CIGAR S) in the TLEN field of SAM file.

    For example, if the positions and CIGAR strings of the two mates are as follows:

    Mate 1: pos 1000 CIGAR 25M10000N25M
    Mate 2: pos 11300 CIGAR 40M10S

    I can think of four different TLEN values

    1. 11300 - 1000 + 40 = 10340 (include intron)
    2. 11300 - 1000 + 40 + 10 = 10350 (include intron + softclip)
    3. 11300 - 1000 + 40 - 10000 = 340 (don't include intron or softclip)
    4. 11300 - 1000 + 40 + 10 - 10000 = 350 (include softclip)

    Thank you.

  • #2
    Here's how lh3 calculates it in bwa (from bwase.c) ... I've added comments ...

    Code:
    int64_t pos_end(const bwa_seq_t *p)  // traverse cigar to get to the other side
    {
            if (p->cigar) { // if cigar data, data in an array but not actually a string
                    int j;
                    int64_t x = p->pos;  // chromsome position
                    for (j = 0; j != p->n_cigar; ++j) {  // for each cigar operation in cigar data
                            int op = __cigar_op(p->cigar[j]);
    //operations, 0 is match, 2 is del , cigar_len is the length specified for the operation
    // operation ==3 is 'N' , should be here if BWA uses "skip" operation
                            if (op == 0 || op == 2) x += __cigar_len(p->cigar[j]);
                    }
                    return x;
            } else return p->pos + p->len;
    }
    static int64_t pos_5(const bwa_seq_t *p)
    {
            if (p->type != BWA_TYPE_NO_MATCH)
                    return p->strand? pos_end(p) : p->pos;  // note, adjust for strand
            return -1;
    }
    
    // actual assignment ...
    isize = (seqid == m_seqid)? pos_5(mate) - pos_5(p) : 0;
    
    Operations from bam.h are ...
    bam.h:#define BAM_CMATCH      0
    bam.h:#define BAM_CINS        1
    bam.h:#define BAM_CDEL        2
    bam.h:#define BAM_CREF_SKIP   3
    bam.h:#define BAM_CSOFT_CLIP  4
    bam.h:#define BAM_CHARD_CLIP  5
    bam.h:#define BAM_CPAD        6
    bam.h:#define BAM_CEQUAL        7
    bam.h:#define BAM_CDIFF        8
    The SAM specification says for TLEN ...
    TLEN: signed observed Template LENgth. If all segments are mapped to the same reference, the
    unsigned observed template length equals the number of bases from the leftmost mapped base
    to the rightmost mapped base. The leftmost segment has a plus sign and the rightmost has a
    minus sign. The sign of segments in the middle is undefined. It is set as 0 for single-segment
    template or when the information is unavailable.

    A template is ...
    Template A DNA/RNA sequence part of which is sequenced on a sequencing machine or assembled
    from raw sequences.

    Edit ... samtools FIXMATE has some code, it calls bam_calend()
    which DOES use SKIP
    Code:
    uint32_t bam_calend(const bam1_core_t *c, const uint32_t *cigar)
    {
            uint32_t k, end;
            end = c->pos;
            for (k = 0; k < c->n_cigar; ++k) {
                    int op = cigar[k] & BAM_CIGAR_MASK;
                    if (op == BAM_CMATCH || op == BAM_CDEL || op == BAM_CREF_SKIP)
                            end += cigar[k] >> BAM_CIGAR_SHIFT;
            }
            return end;
    }
    
    // call to calend() is
    uint32_t cur5, pre5;
    cur5 = (cur->core.flag&BAM_FREVERSE)? bam_calend(&cur->core, bam1_cigar(cur)) : cur->core.pos;
    pre5 = (pre->core.flag&BAM_FREVERSE)? bam_calend(&pre->core, bam1_cigar(pre)) : pre->core.pos;
    cur->core.isize = pre5 - cur5; pre->core.isize = cur5 - pre5;
    Sooo .. use Match(M) Del(D) Skip(N) but Don't use Clip(S).
    Last edited by Richard Finney; 07-13-2012, 07:01 PM.

    Comment


    • #3
      Thank you for the reply.
      It seems like bowtie2-beta6 local mode includes softclip(S) in TLEN.
      Here's an example:

      GA1_0037_FC:1:1:15854:1042#0 83 chr10 96987440 21 40M = 96987319 -165 CAGTGTTGCTGACAAGGTGACACTGAACACAACAGTTNTC GHGHEHHGHHGEGGGEGDGAGAGGCGGGEGAABCB<<#<? AS:i:70 XS:i:38 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:29A7T2 YS:i:72 YT:Z:CP
      GA1_0037_FC:1:1:15854:1042#0 163 chr10 96987319 21 4S36M = 96987440 165 TTTTTTTGTAGTCTATACATTTATTGAGTAAAAACAAAAT IIIIIHDIIHIIIIHIIGIIIIIIIIHIHIECFFEDDGGD AS:i:72 XS:i:61 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:36 YS:i:70 YT:Z:CP

      pos1 96987319 CIGAR1 4S36M
      pos2 96987440 CIGAR2 40M
      TLEN = 96987440 - 96987319 + 40 + 4(softclip) = 165

      I guess most aligners will never need the softclip(S) in CIGAR.
      Do you think bowtie2 is inconsistent with SAM definition?
      Last edited by okyoon; 07-17-2012, 02:56 PM.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Current Approaches to Protein Sequencing
        by seqadmin


        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
        04-04-2024, 04:25 PM
      • seqadmin
        Strategies for Sequencing Challenging Samples
        by seqadmin


        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
        03-22-2024, 06:39 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, 04-11-2024, 12:08 PM
      0 responses
      24 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 10:19 PM
      0 responses
      25 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 09:21 AM
      0 responses
      21 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-04-2024, 09:00 AM
      0 responses
      52 views
      0 likes
      Last Post seqadmin  
      Working...
      X