SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
to claculte % deletion and insertion after mapping. vishwesh Bioinformatics 2 01-09-2015 09:23 AM
how does insertion/deletion detected? penolove Bioinformatics 0 11-18-2013 07:31 PM
Are intron size and softclipped reads included in TLEN field of SAM format? okyoon Bioinformatics 2 07-17-2012 02:47 PM
SAM format. Strange encoding in QUAL field -no ASCII tomjan Bioinformatics 9 04-26-2012 07:08 AM
SAM Format - SEQ field '=' Bio.X2Y Bioinformatics 3 04-25-2012 04:26 AM

Reply
 
Thread Tools
Old 01-08-2015, 07:30 PM   #1
blaboon
Bioinformatics Specialist
 
Location: Singapore

Join Date: Nov 2014
Posts: 11
Default In the SAM format, how the POS field affected by insertion and deletion

Hi all,

In the Sequence Alignment/Map Format Specification document, it says that the POS field is actually "1-based leftmost mapping POSition of the first matching base." So, based on my understanding, it is the first position in the reference on which the first alignment match (M, can be a sequence match or mismatch) occurs, am I right? If this is the case, I infer that (since I didn't find any such situation in my own data), if the CIGAR is like 10S1D139M, I mean if the first operator after the clip (H or S) is D (deletion), so the POS should be one after the position where the deletion happens?

Actually, I am trying to develop a tool which can handle the read "overlap" in the pair-end data. As you know, the "overlap" will happen if our fragment is short and read length of the sequencer is long. In the "overlap" region, actually we can do more, for example, if the two of overlapped bases are not same, we think the mapping at this position is not good enough and will reduce the base quality or something similar.

Now, the problem is I need to find that in the two reads (forward and backward), which two bases are actually from the same position in the reference sequence, this will be easy if all bases are alignment matches (M), no insertion (I) and deletion (D). But if it is not the case, I find that it is a little bit complicated to find the overlap bases, we need to consider the insertion and deletion, for example, if there is an insertion in one position of one read, to check the overlap, we need use the base after this position and the base at the insertion position on the other read in pair (assume that there is no insertion on the other read in pair) . This is just one simple case, I need to find all overlapped bases under this crazy condition and check whether they are equal or not.

If anybody have better solution and willing to share, it will be appreciated very much. Thank you very much.

bless~
blaboon is offline   Reply With Quote
Old 01-08-2015, 10:13 PM   #2
lindenb
Senior Member
 
Location: France

Join Date: Apr 2010
Posts: 143
Default

"I mean if the first operator after the clip (H or S) is D (deletion), so the POS should be one after the position where the deletion happens? "

YES. The position starts at the first M/=.
lindenb is offline   Reply With Quote
Old 01-09-2015, 01:22 AM   #3
blaboon
Bioinformatics Specialist
 
Location: Singapore

Join Date: Nov 2014
Posts: 11
Default

Quote:
Originally Posted by lindenb View Post
"I mean if the first operator after the clip (H or S) is D (deletion), so the POS should be one after the position where the deletion happens? "

YES. The position starts at the first M/=.
Hi, thanks for your quick reply.

In my understanding of the SAM specification, M is equal to X/=, which means alignment match can be sequence match or mismatch , please suggest.

bless~
blaboon is offline   Reply With Quote
Old 01-09-2015, 01:31 AM   #4
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Correct, POS refers to the first of M=X, though you rarely see X or = in real life.
dpryan is offline   Reply With Quote
Old 01-09-2015, 07:12 AM   #5
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,541
Default

This came up in discussion on the samtools-dev mailing list, I think James Bonfield constructed some good examples... Evidently the spec needs a bit more clarification here?
maubp is offline   Reply With Quote
Old 01-09-2015, 07:15 AM   #6
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Do you remember when that came up? I thought I recalled that but couldn't find it with some quick searching.
dpryan is offline   Reply With Quote
Old 01-09-2015, 09:48 AM   #7
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by dpryan View Post
Correct, POS refers to the first of M=X, though you rarely see X or = in real life.
I see it all the time, since BBMap outputs those by default

"10S1D139M" is not a cigar string that should ever be produced. "D" should be internal to "M/X/=". If you see a read that violates this, just throw it away; it's nonsense.

"I" is a bit more tricky; it can occur and be valid at the ends. In that case, those bases should be ignored with respect to the POS, just like "S" bases.

Incidentally, I wrote another tool, BBMerge, which can merge paired reads, and adjusts the quality of overlapping bases to reflect whether or not they match. It's mainly used for merging reads by overlap, but it can merge based on mapping locations also, if you use it like this (the example assumes interleaved reads but they can be in two files also):

bbmap.sh ref=reference.fasta in=reads.fastq outm=mapped.fastq pairedonly renamebymapping pairlen=800
bbmerge.sh in=mapped.fq out=merged.fq usemapping parsecustom


Sorry, bbmerge does not currently work with sam/bam files; it requires custom headers on reads to merge them based on mapping data. The first step maps the reads and adds custom headers. The "pairlen" flag when mapping restricts the maximum distance between paired reads - if the reads overlap this is negative, and if they don't it is positive; insert size = (pairlen + read1 length + read2 length).

Last edited by Brian Bushnell; 01-09-2015 at 10:10 AM.
Brian Bushnell is offline   Reply With Quote
Old 01-09-2015, 04:23 PM   #8
blaboon
Bioinformatics Specialist
 
Location: Singapore

Join Date: Nov 2014
Posts: 11
Default

Quote:
Originally Posted by dpryan View Post
Correct, POS refers to the first of M=X, though you rarely see X or = in real life.
Great, thanks, that really clear my confusion.
blaboon is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:35 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO