Hi all,
In the Sequence Alignment/Map Format Specification document, it says that the POS field is actually "1-based leftmost mapping POSition of the first matching base." So, based on my understanding, it is the first position in the reference on which the first alignment match (M, can be a sequence match or mismatch) occurs, am I right? If this is the case, I infer that (since I didn't find any such situation in my own data), if the CIGAR is like 10S1D139M, I mean if the first operator after the clip (H or S) is D (deletion), so the POS should be one after the position where the deletion happens?
Actually, I am trying to develop a tool which can handle the read "overlap" in the pair-end data. As you know, the "overlap" will happen if our fragment is short and read length of the sequencer is long. In the "overlap" region, actually we can do more, for example, if the two of overlapped bases are not same, we think the mapping at this position is not good enough and will reduce the base quality or something similar.
Now, the problem is I need to find that in the two reads (forward and backward), which two bases are actually from the same position in the reference sequence, this will be easy if all bases are alignment matches (M), no insertion (I) and deletion (D). But if it is not the case, I find that it is a little bit complicated to find the overlap bases, we need to consider the insertion and deletion, for example, if there is an insertion in one position of one read, to check the overlap, we need use the base after this position and the base at the insertion position on the other read in pair (assume that there is no insertion on the other read in pair) . This is just one simple case, I need to find all overlapped bases under this crazy condition and check whether they are equal or not.
If anybody have better solution and willing to share, it will be appreciated very much. Thank you very much.
bless~
In the Sequence Alignment/Map Format Specification document, it says that the POS field is actually "1-based leftmost mapping POSition of the first matching base." So, based on my understanding, it is the first position in the reference on which the first alignment match (M, can be a sequence match or mismatch) occurs, am I right? If this is the case, I infer that (since I didn't find any such situation in my own data), if the CIGAR is like 10S1D139M, I mean if the first operator after the clip (H or S) is D (deletion), so the POS should be one after the position where the deletion happens?
Actually, I am trying to develop a tool which can handle the read "overlap" in the pair-end data. As you know, the "overlap" will happen if our fragment is short and read length of the sequencer is long. In the "overlap" region, actually we can do more, for example, if the two of overlapped bases are not same, we think the mapping at this position is not good enough and will reduce the base quality or something similar.
Now, the problem is I need to find that in the two reads (forward and backward), which two bases are actually from the same position in the reference sequence, this will be easy if all bases are alignment matches (M), no insertion (I) and deletion (D). But if it is not the case, I find that it is a little bit complicated to find the overlap bases, we need to consider the insertion and deletion, for example, if there is an insertion in one position of one read, to check the overlap, we need use the base after this position and the base at the insertion position on the other read in pair (assume that there is no insertion on the other read in pair) . This is just one simple case, I need to find all overlapped bases under this crazy condition and check whether they are equal or not.
If anybody have better solution and willing to share, it will be appreciated very much. Thank you very much.
bless~
Comment