SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
SAM format. Strange encoding in QUAL field -no ASCII tomjan Bioinformatics 9 04-26-2012 07:08 AM
SAM Format - SEQ field '=' Bio.X2Y Bioinformatics 3 04-25-2012 04:26 AM
Counting Reads in SAM Format foolishbrat General 2 06-08-2011 12:38 PM
Looking process to convert gff3 format into ace format or sam format andylai Bioinformatics 1 05-17-2011 02:09 AM
SAM flag field and removing unmapped reads from BFAST output aiden Bioinformatics 3 05-27-2010 06:10 PM

Reply
 
Thread Tools
Old 07-13-2012, 03:09 PM   #1
okyoon
Junior Member
 
Location: Berkeley, CA

Join Date: Sep 2008
Posts: 3
Default Are intron size and softclipped reads included in TLEN field of SAM format?

I would like to know whether I should include the intron size (CIGAR N) and softclipped reads (CIGAR S) in the TLEN field of SAM file.

For example, if the positions and CIGAR strings of the two mates are as follows:

Mate 1: pos 1000 CIGAR 25M10000N25M
Mate 2: pos 11300 CIGAR 40M10S

I can think of four different TLEN values

1. 11300 - 1000 + 40 = 10340 (include intron)
2. 11300 - 1000 + 40 + 10 = 10350 (include intron + softclip)
3. 11300 - 1000 + 40 - 10000 = 340 (don't include intron or softclip)
4. 11300 - 1000 + 40 + 10 - 10000 = 350 (include softclip)

Thank you.
okyoon is offline   Reply With Quote
Old 07-13-2012, 06:34 PM   #2
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 700
Default

Here's how lh3 calculates it in bwa (from bwase.c) ... I've added comments ...

Code:
int64_t pos_end(const bwa_seq_t *p)  // traverse cigar to get to the other side
{
        if (p->cigar) { // if cigar data, data in an array but not actually a string
                int j;
                int64_t x = p->pos;  // chromsome position
                for (j = 0; j != p->n_cigar; ++j) {  // for each cigar operation in cigar data
                        int op = __cigar_op(p->cigar[j]);
//operations, 0 is match, 2 is del , cigar_len is the length specified for the operation
// operation ==3 is 'N' , should be here if BWA uses "skip" operation
                        if (op == 0 || op == 2) x += __cigar_len(p->cigar[j]);
                }
                return x;
        } else return p->pos + p->len;
}
static int64_t pos_5(const bwa_seq_t *p)
{
        if (p->type != BWA_TYPE_NO_MATCH)
                return p->strand? pos_end(p) : p->pos;  // note, adjust for strand
        return -1;
}

// actual assignment ...
isize = (seqid == m_seqid)? pos_5(mate) - pos_5(p) : 0;

Operations from bam.h are ...
bam.h:#define BAM_CMATCH      0
bam.h:#define BAM_CINS        1
bam.h:#define BAM_CDEL        2
bam.h:#define BAM_CREF_SKIP   3
bam.h:#define BAM_CSOFT_CLIP  4
bam.h:#define BAM_CHARD_CLIP  5
bam.h:#define BAM_CPAD        6
bam.h:#define BAM_CEQUAL        7
bam.h:#define BAM_CDIFF        8
The SAM specification says for TLEN ...
TLEN: signed observed Template LENgth. If all segments are mapped to the same reference, the
unsigned observed template length equals the number of bases from the leftmost mapped base
to the rightmost mapped base. The leftmost segment has a plus sign and the rightmost has a
minus sign. The sign of segments in the middle is undefined. It is set as 0 for single-segment
template or when the information is unavailable.

A template is ...
Template A DNA/RNA sequence part of which is sequenced on a sequencing machine or assembled
from raw sequences.

Edit ... samtools FIXMATE has some code, it calls bam_calend()
which DOES use SKIP
Code:
uint32_t bam_calend(const bam1_core_t *c, const uint32_t *cigar)
{
        uint32_t k, end;
        end = c->pos;
        for (k = 0; k < c->n_cigar; ++k) {
                int op = cigar[k] & BAM_CIGAR_MASK;
                if (op == BAM_CMATCH || op == BAM_CDEL || op == BAM_CREF_SKIP)
                        end += cigar[k] >> BAM_CIGAR_SHIFT;
        }
        return end;
}

// call to calend() is
uint32_t cur5, pre5;
cur5 = (cur->core.flag&BAM_FREVERSE)? bam_calend(&cur->core, bam1_cigar(cur)) : cur->core.pos;
pre5 = (pre->core.flag&BAM_FREVERSE)? bam_calend(&pre->core, bam1_cigar(pre)) : pre->core.pos;
cur->core.isize = pre5 - cur5; pre->core.isize = cur5 - pre5;
Sooo .. use Match(M) Del(D) Skip(N) but Don't use Clip(S).

Last edited by Richard Finney; 07-13-2012 at 07:01 PM.
Richard Finney is offline   Reply With Quote
Old 07-17-2012, 02:47 PM   #3
okyoon
Junior Member
 
Location: Berkeley, CA

Join Date: Sep 2008
Posts: 3
Default

Thank you for the reply.
It seems like bowtie2-beta6 local mode includes softclip(S) in TLEN.
Here's an example:

GA1_0037_FC:1:1:15854:1042#0 83 chr10 96987440 21 40M = 96987319 -165 CAGTGTTGCTGACAAGGTGACACTGAACACAACAGTTNTC GHGHEHHGHHGEGGGEGDGAGAGGCGGGEGAABCB<<#<? AS:i:70 XS:i:38 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:29A7T2 YS:i:72 YT:Z:CP
GA1_0037_FC:1:1:15854:1042#0 163 chr10 96987319 21 4S36M = 96987440 165 TTTTTTTGTAGTCTATACATTTATTGAGTAAAAACAAAAT IIIIIHDIIHIIIIHIIGIIIIIIIIHIHIECFFEDDGGD AS:i:72 XS:i:61 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:36 YS:i:70 YT:Z:CP

pos1 96987319 CIGAR1 4S36M
pos2 96987440 CIGAR2 40M
TLEN = 96987440 - 96987319 + 40 + 4(softclip) = 165

I guess most aligners will never need the softclip(S) in CIGAR.
Do you think bowtie2 is inconsistent with SAM definition?

Last edited by okyoon; 07-17-2012 at 02:56 PM.
okyoon is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:09 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO