SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
BWA generating incorrect CIGAR string? foxyg Bioinformatics 6 09-16-2011 11:22 AM
The 'S' in CIGAR of sam file (bwa) qixiaofei General 6 09-15-2011 11:28 PM
bowtie - invalid CIGAR string - wrong sam format genome Bioinformatics 2 02-16-2011 01:36 PM
generate CIGAR string from 2 sequences? bbimber Bioinformatics 0 03-20-2010 09:44 AM
bwa MD and cigar fields inconsistency biterbilen Bioinformatics 4 07-28-2009 08:37 AM

Reply
 
Thread Tools
Old 06-22-2010, 04:48 PM   #1
robs
Senior Member
 
Location: San Diego, CA

Join Date: May 2010
Posts: 116
Default CIGAR string from BWA-SW output incorrect ?

I am quite happy with BWA-SW, except that it seems to generate incorrect CIGAR strings (and therefore incorrect SEQ fields) in the SAM output files. Maybe, I am confused with the output. Take a look at the two examples and please help me to understand them. I have a lot more of these, if you want more complex ones.

Example 1: reference sequence is 145 bp long, alignment starts at 92 and has 112M, meaning 92+112 > 145, right?
@SQ SN:ref1 LN:145
q1 0 ref1 92 0 65S112M148S * 0 0 CACGCACGCACGCACACACACACACACACACACACACACACACACACACACACACACACACACACGCACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACGGACACAGACACAGACAAAGACTCAGACACAAACTCAGACACAGACACAGATACAGACACAGACAAAGACACAGAAACTGAAACAGACACACAGACACAGACACAGACACAGACAAAGACACAGAAACTGAAACAGACACACAGACAC * AS:i:154 XS:i:154 XF:i:2 XE:i:0 XN:i:0

Example 2: reference sequence is 139 bp long, alignment starts at 1 and has 83M1D56M, meaning 83+1+56 > 139, right?
@SQ SN:ref2 LN:139
q2 16 ref2 1 1 92S83M1D56M201S * 0 0 GGAATTCAAAGTAATGGAAACAAACCTAGTGGAATATAATGGAATGGAAAGGACTGGAATGTAATGGAATGGATTGGAATAAACCCGATTCCAATGCAATGGAATGGAATGGAATGGAATGGAATGGAATCGAACGGAATCAACCTGAGTTTAATGGAATGGAATGGAATGGAATGAATGGAATGGAATTCAATGTAATGGAAACAAACCGAGGGGAATATAATGTAATGGAAAGGACTGGAATGTAATGGAATGGATTGGAATCAACCCGATTCCTATGCAACGGAATGGAATGTAATGGAATGGAATGGAAAGGAATGGAAAGAACTGGAATGGAATGGAATGGAATGGAATTTAATGGAATGGAATGCAATGGAATGGAATCAACCTGAGAGGAGTGGAATGGAATGGAATGGAAATGAATGGAACCGA * AS:i:73 XS:i:69 XF:i:2 XE:i:1 XN:i:0
robs is offline   Reply With Quote
Old 06-22-2010, 04:56 PM   #2
robs
Senior Member
 
Location: San Diego, CA

Join Date: May 2010
Posts: 116
Default

I forgot to mention that I am using bwa-0.5.8.
robs is offline   Reply With Quote
Old 06-22-2010, 09:31 PM   #3
robs
Senior Member
 
Location: San Diego, CA

Join Date: May 2010
Posts: 116
Default

One more question:
Does it make sense to have an insert after a soft clipping region?

For example, I get the CIGAR string 156S15I264M13I189M.

Any help is appreciated. :-)
robs is offline   Reply With Quote
Old 06-23-2010, 05:19 PM   #4
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

internally your reference sequences are concatenated. bwasw may map sequences to the junction of adjacent sequences. bwasw identifies and fixes cigar afterwards. unfortunately, it does not always work properly. this will be fixed in future. due to this flaw, bwasw would not work well for many very short references. i am sorry.

perhaps you may post your references and one read, so can I fix when have time. thanks.
lh3 is offline   Reply With Quote
Old 06-23-2010, 05:57 PM   #5
robs
Senior Member
 
Location: San Diego, CA

Join Date: May 2010
Posts: 116
Default

Thanks for the explanation. :-)

The sequences were aligned to the unique sequences in the Watson genome (http://www.ncbi.nlm.nih.gov/sites/en...m=ABKV01000000). There are two example sequences in my first post. If you need more, I can either send you a SAM file with lots of weird looking CIGAR strings or just some FASTA file. Would be happy to get this problem solved, so I can use BWA-SW. :-)

Any change that this will be fixed in the next few weeks, or should I take a look at the source code myself?
Is there a function to get the actual alignment instead of the CIGAR string? Maybe, this would help to figure out the problem faster.
robs is offline   Reply With Quote
Old 06-23-2010, 06:12 PM   #6
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

I am afraid that I cannot fix that soon. If you would like to have a look, you may check the fix_cigar() function in bwtsw2_aux.c. This function trims the part of cigar standing out of the reference sequence. It is moderately standalone.

Thank you.
lh3 is offline   Reply With Quote
Old 07-13-2010, 12:14 AM   #7
glacierbird
Member
 
Location: germany

Join Date: Dec 2009
Posts: 15
Default

I have the same problem with CIGAR by using TOPHAT

tophat -p 4 --solexa1.3-quals -o directory index fastq

there are several lines cause error msg when I want to convert .sam to .bam

samtools view -bS -t index.fa.fai -o tmp.bam accepted_hits.sam

Parse error at line 2103: CIGAR and sequence length are inconsistent

And this line 2103 looks like this:

HWI-EAS210R:5:55:1018:1930#0 0 scaffold00002 428427 1 36M92N36M473N536870911M * 0 0 CTCATCTCTCATCTCAGAGAGGTCCTCCAGGAGCAGGAATATAAGTTGGATCAAGTACGCAATCATCTTCA 992;93?5B=<B=;7;?;:988:146.4429;50?;;17@=A@6=7::<8?(=4+45535659'5525:74 NM:i:0 XS:A:+ NS:i:0
glacierbird is offline   Reply With Quote
Old 07-13-2010, 12:15 AM   #8
glacierbird
Member
 
Location: germany

Join Date: Dec 2009
Posts: 15
Default

I have tried to delete
536870911M

But it still doesn't work.

I have no clue where is these 536970911 from.
glacierbird is offline   Reply With Quote
Old 07-13-2010, 03:24 PM   #9
robs
Senior Member
 
Location: San Diego, CA

Join Date: May 2010
Posts: 116
Default

The problem is not solved by removing the last term of the cigar string, since then you will miss part of the alignment. You could calculate the length of the query and replace the last M value with the difference of cigar and query length.

I am playing around with the BWA code to fix the problem for BWA. Not sure about Tophat, though.

I assume the problem comes from the concatenated references as well or simply by miscalculating the cigar string when going over the boundaries.

Please keep us updated if you find a solution.
robs is offline   Reply With Quote
Old 07-14-2010, 04:28 AM   #10
glacierbird
Member
 
Location: germany

Join Date: Dec 2009
Posts: 15
Default

Hi,


Quote:
Originally Posted by robs View Post
The problem is not solved by removing the last term of the cigar string, since then you will miss part of the alignment. You could calculate the length of the query and replace the last M value with the difference of cigar and query length.
what do you mean of query length? I suppose query length = read length. If so, every read length=72

From this CIGAR "36M92N36M473N536870911M"
36M+36M=72, that's why I want to completely delete 536870911M


Quote:
I am playing around with the BWA code to fix the problem for BWA. Not sure about Tophat, though.
I think the error is not produced by tophat, it actually happened when tophat runs bowtie.

Quote:
I assume the problem comes from the concatenated references as well or simply by miscalculating the cigar string when going over the boundaries.

Please keep us updated if you find a solution.
I did modify the reference fasta file. becasue of samtools.
samtools faidx fasta_file.fa
[fai_build_core] line length exceeds 65535 in sequence

I used Bio:seqIO to reform the reference fasta file with defaults line wrapping of 60bp.

I am not sure that's the reason for the error msg.

Last edited by glacierbird; 07-14-2010 at 04:34 AM.
glacierbird is offline   Reply With Quote
Old 07-14-2010, 10:33 AM   #11
robs
Senior Member
 
Location: San Diego, CA

Join Date: May 2010
Posts: 116
Default

6
Quote:
M+36M=72, that's why I want to completely delete 536870911M
Then you also need to remove the part with N, I think. Just give it a try.

Quote:
I did modify the reference fasta file. becasue of samtools.
samtools faidx fasta_file.fa
[fai_build_core] line length exceeds 65535 in sequence

I used Bio:seqIO to reform the reference fasta file with defaults line wrapping of 60bp.

I am not sure that's the reason for the error msg.
The line wrapping should not make a difference, since the sequences should be converted into internal file formats when generating a database or index. Maybe this is a restriction from the C code for parsing lines, but not sure.
The only reason why the reformating might have changed anything is if your file contained dos/win line breaks instead of unix ones and the program does not check for it.
robs is offline   Reply With Quote
Old 08-05-2010, 04:49 PM   #12
robs
Senior Member
 
Location: San Diego, CA

Join Date: May 2010
Posts: 116
Default

Quote:
Originally Posted by lh3 View Post
I am afraid that I cannot fix that soon. If you would like to have a look, you may check the fix_cigar() function in bwtsw2_aux.c. This function trims the part of cigar standing out of the reference sequence. It is moderately standalone.
I made some changes/fixes to several files related to the BWA-SW part and was wondering if you could give me permission to commit my changes to your sourceforge SVN. Please send me a privat message if you need more details about the changes. Thanks!
robs is offline   Reply With Quote
Old 03-21-2011, 03:26 PM   #13
Guidobot
Junior Member
 
Location: San Diego

Join Date: Jan 2011
Posts: 8
Default

Quote:
Originally Posted by glacierbird View Post
I have the same problem with CIGAR by using TOPHAT

tophat -p 4 --solexa1.3-quals -o directory index fastq

there are several lines cause error msg when I want to convert .sam to .bam

samtools view -bS -t index.fa.fai -o tmp.bam accepted_hits.sam

Parse error at line 2103: CIGAR and sequence length are inconsistent

And this line 2103 looks like this:

HWI-EAS210R:5:55:1018:1930#0 0 scaffold00002 428427 1 36M92N36M473N536870911M * 0 0 CTCATCTCTCATCTCAGAGAGGTCCTCCAGGAGCAGGAATATAAGTTGGATCAAGTACGCAATCATCTTCA 992;93?5B=<B=;7;?;:988:146.4429;50?;;17@=A@6=7::<8?(=4+45535659'5525:74 NM:i:0 XS:A:+ NS:i:0
I'm having a very similar issue with my SAM file produced from a BWA mapping.
Whenever there there is a more complex CIGAR string put out the quality score string is corrupted or missing, resulting in a "sequence and quality are inconsistent" error when attempting to convert to BAM.

I've been removing the offending mapping lines but this means repeatedly editing a large file (500Mb), which takes considerable time (e.g. using head/tail).
Guidobot is offline   Reply With Quote
Old 01-13-2012, 04:07 AM   #14
lukas1848
Member
 
Location: Germany

Join Date: Jun 2011
Posts: 54
Default

Does anyone know whether this issue has been resolved in newer versions of BWA?
lukas1848 is offline   Reply With Quote
Reply

Tags
bwa, bwasw, cigar, sam

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:40 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO