SEQanswers Are intron size and softclipped reads included in TLEN field of SAM format?
 User Name Remember Me? Password
 Register FAQ Members List Calendar Search Today's Posts Mark Forums Read

 Similar Threads Thread Thread Starter Forum Replies Last Post tomjan Bioinformatics 9 04-26-2012 07:08 AM Bio.X2Y Bioinformatics 3 04-25-2012 04:26 AM foolishbrat General 2 06-08-2011 12:38 PM andylai Bioinformatics 1 05-17-2011 02:09 AM aiden Bioinformatics 3 05-27-2010 06:10 PM

 07-13-2012, 03:09 PM #1 okyoon Junior Member   Location: Berkeley, CA Join Date: Sep 2008 Posts: 3 Are intron size and softclipped reads included in TLEN field of SAM format? I would like to know whether I should include the intron size (CIGAR N) and softclipped reads (CIGAR S) in the TLEN field of SAM file. For example, if the positions and CIGAR strings of the two mates are as follows: Mate 1: pos 1000 CIGAR 25M10000N25M Mate 2: pos 11300 CIGAR 40M10S I can think of four different TLEN values 1. 11300 - 1000 + 40 = 10340 (include intron) 2. 11300 - 1000 + 40 + 10 = 10350 (include intron + softclip) 3. 11300 - 1000 + 40 - 10000 = 340 (don't include intron or softclip) 4. 11300 - 1000 + 40 + 10 - 10000 = 350 (include softclip) Thank you.
 07-13-2012, 06:34 PM #2 Richard Finney Senior Member   Location: bethesda Join Date: Feb 2009 Posts: 700 Here's how lh3 calculates it in bwa (from bwase.c) ... I've added comments ... Code: ```int64_t pos_end(const bwa_seq_t *p) // traverse cigar to get to the other side { if (p->cigar) { // if cigar data, data in an array but not actually a string int j; int64_t x = p->pos; // chromsome position for (j = 0; j != p->n_cigar; ++j) { // for each cigar operation in cigar data int op = __cigar_op(p->cigar[j]); //operations, 0 is match, 2 is del , cigar_len is the length specified for the operation // operation ==3 is 'N' , should be here if BWA uses "skip" operation if (op == 0 || op == 2) x += __cigar_len(p->cigar[j]); } return x; } else return p->pos + p->len; } static int64_t pos_5(const bwa_seq_t *p) { if (p->type != BWA_TYPE_NO_MATCH) return p->strand? pos_end(p) : p->pos; // note, adjust for strand return -1; } // actual assignment ... isize = (seqid == m_seqid)? pos_5(mate) - pos_5(p) : 0; Operations from bam.h are ... bam.h:#define BAM_CMATCH 0 bam.h:#define BAM_CINS 1 bam.h:#define BAM_CDEL 2 bam.h:#define BAM_CREF_SKIP 3 bam.h:#define BAM_CSOFT_CLIP 4 bam.h:#define BAM_CHARD_CLIP 5 bam.h:#define BAM_CPAD 6 bam.h:#define BAM_CEQUAL 7 bam.h:#define BAM_CDIFF 8``` The SAM specification says for TLEN ... TLEN: signed observed Template LENgth. If all segments are mapped to the same reference, the unsigned observed template length equals the number of bases from the leftmost mapped base to the rightmost mapped base. The leftmost segment has a plus sign and the rightmost has a minus sign. The sign of segments in the middle is undefined. It is set as 0 for single-segment template or when the information is unavailable. A template is ... Template A DNA/RNA sequence part of which is sequenced on a sequencing machine or assembled from raw sequences. Edit ... samtools FIXMATE has some code, it calls bam_calend() which DOES use SKIP Code: ```uint32_t bam_calend(const bam1_core_t *c, const uint32_t *cigar) { uint32_t k, end; end = c->pos; for (k = 0; k < c->n_cigar; ++k) { int op = cigar[k] & BAM_CIGAR_MASK; if (op == BAM_CMATCH || op == BAM_CDEL || op == BAM_CREF_SKIP) end += cigar[k] >> BAM_CIGAR_SHIFT; } return end; } // call to calend() is uint32_t cur5, pre5; cur5 = (cur->core.flag&BAM_FREVERSE)? bam_calend(&cur->core, bam1_cigar(cur)) : cur->core.pos; pre5 = (pre->core.flag&BAM_FREVERSE)? bam_calend(&pre->core, bam1_cigar(pre)) : pre->core.pos; cur->core.isize = pre5 - cur5; pre->core.isize = cur5 - pre5;``` Sooo .. use Match(M) Del(D) Skip(N) but Don't use Clip(S). Last edited by Richard Finney; 07-13-2012 at 07:01 PM.
 07-17-2012, 02:47 PM #3 okyoon Junior Member   Location: Berkeley, CA Join Date: Sep 2008 Posts: 3 Thank you for the reply. It seems like bowtie2-beta6 local mode includes softclip(S) in TLEN. Here's an example: GA1_0037_FC:1:1:15854:1042#0 83 chr10 96987440 21 40M = 96987319 -165 CAGTGTTGCTGACAAGGTGACACTGAACACAACAGTTNTC GHGHEHHGHHGEGGGEGDGAGAGGCGGGEGAABCB<<#