SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
CIGAR string from BWA-SW output incorrect ? robs Bioinformatics 13 01-13-2012 05:07 AM
BWA generating incorrect CIGAR string? foxyg Bioinformatics 6 09-16-2011 12:22 PM
The 'S' in CIGAR of sam file (bwa) qixiaofei General 6 09-16-2011 12:28 AM
BWA optional fields kobib Bioinformatics 0 09-06-2011 05:41 AM
BWA: specifying SAM/BAM file header fields before read alignment? nora Bioinformatics 3 12-04-2010 10:11 PM

Reply
 
Thread Tools
Old 07-25-2009, 02:59 AM   #1
biterbilen
Junior Member
 
Location: Basel

Join Date: Jun 2009
Posts: 6
Default bwa MD and cigar fields inconsistency

Hello,

I have sequencing data where the position and frequency of mismatches play important role in the downstream analysis. I generated short read mappings using BWA. In the samse output file I have inconsistent MD and cigar fields. As far as I saw MD field is generated from CIGAR field and should be consistent with it. Did anyone have the same problem?

>less uniq_part_001.fastq.sam | cut -f 1-6,10,19 |grep -v "*"| head
seqAAA_0 16 chr12 50113894 25 36M CGCCATCTGTTTTTTTTTTTTTTTTTTTTTTTTTTT MD:Z:1A34
seqAAA_1 0 chr4 178545809 0 36M AAAAAAAAAAAAAAAAAAAAAAAAAAACATATGCCT MD:Z:34T1
seqAAA_3 16 chr10 6381930 25 36M CAGAAGACGTTTTTTTTTTTTTTTTTTTTTTTTTTT MD:Z:8T27
seqAAA_4 0 chr4 39577867 25 36M AAAAAAAAAAAAAAAAAAAAAAAAAAACGTTTGCCC MD:Z:27A8
seqAAA_7 0 chr9 117092453 25 36M AAAAAAAAAAAAAAAAAAAAAAAAAAACTTATGCCC MD:Z:26C9
seqAAA_8 16 chr20 17734163 25 36M CGGAATAAGTTTTTTTTTTTTTTTTTTTTTTTTTTT MD:Z:0A35
seqAAA_10 0 chr8 112121071 0 36M AAAAAAAAAAAAAAAAAAAAAAAAAAACTTTAGCCG MD:Z:28A7
seqAAA_11 16 chr2 66418968 0 36M GCATACCTCTTTTTTTTTTTTTTTTTTTTTTTTTTT MD:Z:36
seqAAA_12 0 chr16 73425684 0 36M AAAAAAAAAAAAAAAAAAAAAAAAAAAGCTGGGACC MD:Z:34G1
seqAAA_13 16 chr3 22762342 0 36M GGGGCATCCTTTTTTTTTTTTTTTTTTTTTTTTTTT MD:Z:1T34

Biter Bilen
biterbilen is offline   Reply With Quote
Old 07-27-2009, 10:56 AM   #2
SillyPoint
Member
 
Location: Frederick MD, USA

Join Date: May 2008
Posts: 39
Default

According to the latest (I think) SAM format spec, at http://samtools.sourceforge.net/SAM1.pdf, an M in the Cigar field is either "Match or mismatch" (sec. 2.2.3) -- as differentiated from an insertion or deletion. Only in the MD field do you get told whether it was a match or mismatch at a given position (see footnote 3 to the table in sec 2.2.4).

So for your first read, the reference has an A in the 2nd position, where the read has a G. All other bases match.

Not sure why Cigar works that way -- historical reasons, probably. lh3 may know more.

SillyPoint
SillyPoint is offline   Reply With Quote
Old 07-27-2009, 12:18 PM   #3
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by SillyPoint View Post
According to the latest (I think) SAM format spec, at http://samtools.sourceforge.net/SAM1.pdf, an M in the Cigar field is either "Match or mismatch" (sec. 2.2.3) -- as differentiated from an insertion or deletion. Only in the MD field do you get told whether it was a match or mismatch at a given position (see footnote 3 to the table in sec 2.2.4).

So for your first read, the reference has an A in the 2nd position, where the read has a G. All other bases match.

Not sure why Cigar works that way -- historical reasons, probably. lh3 may know more.

SillyPoint
I believe protein encoded cigar strings (the originator of CIGAR or run length encoding in sequence formats) differentiate between match and mismatch, although there is discussion of changing the SAM format to do this as well (among many other things). For the SAM format, you could check out the active discussion the sourceforge SAMtools developer emailing list for more information as this is an active area of discussion. Please commenting on the format here and on the mailing lists so the community and developers can respond to your needs in any upcoming SAM format.
nilshomer is offline   Reply With Quote
Old 07-28-2009, 01:27 AM   #4
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

Someone told me CIGAR was first introduced in Ensembl/exonerate and was designed for nucleotide alignment. The original CIGAR only contain three operations: M/I/D where M stands for alignment match and can be a sequence match or mismatch. SAM's CIGAR is an extension and so keeps M. We are in the middle of adding new operations to differentiate sequence match and mismatch.
lh3 is offline   Reply With Quote
Old 07-28-2009, 09:37 AM   #5
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by lh3 View Post
Someone told me CIGAR was first introduced in Ensembl/exonerate and was designed for nucleotide alignment. The original CIGAR only contain three operations: M/I/D where M stands for alignment match and can be a sequence match or mismatch. SAM's CIGAR is an extension and so keeps M. We are in the middle of adding new operations to differentiate sequence match and mismatch.
I guess I should question my sources. I think Guy Slater first introduced CIGAR for Exonerate? Looking at exonerate, you are right it is only M/I/D. Anyone else know more?
nilshomer is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:25 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO