Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • robs
    Senior Member
    • May 2010
    • 116

    CIGAR string from BWA-SW output incorrect ?

    I am quite happy with BWA-SW, except that it seems to generate incorrect CIGAR strings (and therefore incorrect SEQ fields) in the SAM output files. Maybe, I am confused with the output. Take a look at the two examples and please help me to understand them. I have a lot more of these, if you want more complex ones.

    Example 1: reference sequence is 145 bp long, alignment starts at 92 and has 112M, meaning 92+112 > 145, right?
    @SQ SN:ref1 LN:145
    q1 0 ref1 92 0 65S112M148S * 0 0 CACGCACGCACGCACACACACACACACACACACACACACACACACACACACACACACACACACACGCACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACGGACACAGACACAGACAAAGACTCAGACACAAACTCAGACACAGACACAGATACAGACACAGACAAAGACACAGAAACTGAAACAGACACACAGACACAGACACAGACACAGACAAAGACACAGAAACTGAAACAGACACACAGACAC * AS:i:154 XS:i:154 XF:i:2 XE:i:0 XN:i:0

    Example 2: reference sequence is 139 bp long, alignment starts at 1 and has 83M1D56M, meaning 83+1+56 > 139, right?
    @SQ SN:ref2 LN:139
    q2 16 ref2 1 1 92S83M1D56M201S * 0 0 GGAATTCAAAGTAATGGAAACAAACCTAGTGGAATATAATGGAATGGAAAGGACTGGAATGTAATGGAATGGATTGGAATAAACCCGATTCCAATGCAATGGAATGGAATGGAATGGAATGGAATGGAATCGAACGGAATCAACCTGAGTTTAATGGAATGGAATGGAATGGAATGAATGGAATGGAATTCAATGTAATGGAAACAAACCGAGGGGAATATAATGTAATGGAAAGGACTGGAATGTAATGGAATGGATTGGAATCAACCCGATTCCTATGCAACGGAATGGAATGTAATGGAATGGAATGGAAAGGAATGGAAAGAACTGGAATGGAATGGAATGGAATGGAATTTAATGGAATGGAATGCAATGGAATGGAATCAACCTGAGAGGAGTGGAATGGAATGGAATGGAAATGAATGGAACCGA * AS:i:73 XS:i:69 XF:i:2 XE:i:1 XN:i:0
  • robs
    Senior Member
    • May 2010
    • 116

    #2
    I forgot to mention that I am using bwa-0.5.8.

    Comment

    • robs
      Senior Member
      • May 2010
      • 116

      #3
      One more question:
      Does it make sense to have an insert after a soft clipping region?

      For example, I get the CIGAR string 156S15I264M13I189M.

      Any help is appreciated. :-)

      Comment

      • lh3
        Senior Member
        • Feb 2008
        • 686

        #4
        internally your reference sequences are concatenated. bwasw may map sequences to the junction of adjacent sequences. bwasw identifies and fixes cigar afterwards. unfortunately, it does not always work properly. this will be fixed in future. due to this flaw, bwasw would not work well for many very short references. i am sorry.

        perhaps you may post your references and one read, so can I fix when have time. thanks.

        Comment

        • robs
          Senior Member
          • May 2010
          • 116

          #5
          Thanks for the explanation. :-)

          The sequences were aligned to the unique sequences in the Watson genome (http://www.ncbi.nlm.nih.gov/sites/en...m=ABKV01000000). There are two example sequences in my first post. If you need more, I can either send you a SAM file with lots of weird looking CIGAR strings or just some FASTA file. Would be happy to get this problem solved, so I can use BWA-SW. :-)

          Any change that this will be fixed in the next few weeks, or should I take a look at the source code myself?
          Is there a function to get the actual alignment instead of the CIGAR string? Maybe, this would help to figure out the problem faster.

          Comment

          • lh3
            Senior Member
            • Feb 2008
            • 686

            #6
            I am afraid that I cannot fix that soon. If you would like to have a look, you may check the fix_cigar() function in bwtsw2_aux.c. This function trims the part of cigar standing out of the reference sequence. It is moderately standalone.

            Thank you.

            Comment

            • glacierbird
              Member
              • Dec 2009
              • 15

              #7
              I have the same problem with CIGAR by using TOPHAT

              tophat -p 4 --solexa1.3-quals -o directory index fastq

              there are several lines cause error msg when I want to convert .sam to .bam

              samtools view -bS -t index.fa.fai -o tmp.bam accepted_hits.sam

              Parse error at line 2103: CIGAR and sequence length are inconsistent

              And this line 2103 looks like this:

              HWI-EAS210R:5:55:1018:1930#0 0 scaffold00002 428427 1 36M92N36M473N536870911M * 0 0 CTCATCTCTCATCTCAGAGAGGTCCTCCAGGAGCAGGAATATAAGTTGGATCAAGTACGCAATCATCTTCA 992;93?5B=<B=;7;?;:988:146.4429;50?;;17@=A@6=7::<8?(=4+45535659'5525:74 NM:i:0 XS:A:+ NS:i:0

              Comment

              • glacierbird
                Member
                • Dec 2009
                • 15

                #8
                I have tried to delete
                536870911M

                But it still doesn't work.

                I have no clue where is these 536970911 from.

                Comment

                • robs
                  Senior Member
                  • May 2010
                  • 116

                  #9
                  The problem is not solved by removing the last term of the cigar string, since then you will miss part of the alignment. You could calculate the length of the query and replace the last M value with the difference of cigar and query length.

                  I am playing around with the BWA code to fix the problem for BWA. Not sure about Tophat, though.

                  I assume the problem comes from the concatenated references as well or simply by miscalculating the cigar string when going over the boundaries.

                  Please keep us updated if you find a solution.

                  Comment

                  • glacierbird
                    Member
                    • Dec 2009
                    • 15

                    #10
                    Hi,


                    Originally posted by robs View Post
                    The problem is not solved by removing the last term of the cigar string, since then you will miss part of the alignment. You could calculate the length of the query and replace the last M value with the difference of cigar and query length.
                    what do you mean of query length? I suppose query length = read length. If so, every read length=72

                    From this CIGAR "36M92N36M473N536870911M"
                    36M+36M=72, that's why I want to completely delete 536870911M


                    I am playing around with the BWA code to fix the problem for BWA. Not sure about Tophat, though.
                    I think the error is not produced by tophat, it actually happened when tophat runs bowtie.

                    I assume the problem comes from the concatenated references as well or simply by miscalculating the cigar string when going over the boundaries.

                    Please keep us updated if you find a solution.
                    I did modify the reference fasta file. becasue of samtools.
                    samtools faidx fasta_file.fa
                    [fai_build_core] line length exceeds 65535 in sequence

                    I used Bio:seqIO to reform the reference fasta file with defaults line wrapping of 60bp.

                    I am not sure that's the reason for the error msg.
                    Last edited by glacierbird; 07-14-2010, 04:34 AM.

                    Comment

                    • robs
                      Senior Member
                      • May 2010
                      • 116

                      #11
                      6
                      M+36M=72, that's why I want to completely delete 536870911M
                      Then you also need to remove the part with N, I think. Just give it a try.

                      I did modify the reference fasta file. becasue of samtools.
                      samtools faidx fasta_file.fa
                      [fai_build_core] line length exceeds 65535 in sequence

                      I used Bio:seqIO to reform the reference fasta file with defaults line wrapping of 60bp.

                      I am not sure that's the reason for the error msg.
                      The line wrapping should not make a difference, since the sequences should be converted into internal file formats when generating a database or index. Maybe this is a restriction from the C code for parsing lines, but not sure.
                      The only reason why the reformating might have changed anything is if your file contained dos/win line breaks instead of unix ones and the program does not check for it.

                      Comment

                      • robs
                        Senior Member
                        • May 2010
                        • 116

                        #12
                        Originally posted by lh3 View Post
                        I am afraid that I cannot fix that soon. If you would like to have a look, you may check the fix_cigar() function in bwtsw2_aux.c. This function trims the part of cigar standing out of the reference sequence. It is moderately standalone.
                        I made some changes/fixes to several files related to the BWA-SW part and was wondering if you could give me permission to commit my changes to your sourceforge SVN. Please send me a privat message if you need more details about the changes. Thanks!

                        Comment

                        • Guidobot
                          Junior Member
                          • Jan 2011
                          • 8

                          #13
                          Originally posted by glacierbird View Post
                          I have the same problem with CIGAR by using TOPHAT

                          tophat -p 4 --solexa1.3-quals -o directory index fastq

                          there are several lines cause error msg when I want to convert .sam to .bam

                          samtools view -bS -t index.fa.fai -o tmp.bam accepted_hits.sam

                          Parse error at line 2103: CIGAR and sequence length are inconsistent

                          And this line 2103 looks like this:

                          HWI-EAS210R:5:55:1018:1930#0 0 scaffold00002 428427 1 36M92N36M473N536870911M * 0 0 CTCATCTCTCATCTCAGAGAGGTCCTCCAGGAGCAGGAATATAAGTTGGATCAAGTACGCAATCATCTTCA 992;93?5B=<B=;7;?;:988:146.4429;50?;;17@=A@6=7::<8?(=4+45535659'5525:74 NM:i:0 XS:A:+ NS:i:0
                          I'm having a very similar issue with my SAM file produced from a BWA mapping.
                          Whenever there there is a more complex CIGAR string put out the quality score string is corrupted or missing, resulting in a "sequence and quality are inconsistent" error when attempting to convert to BAM.

                          I've been removing the offending mapping lines but this means repeatedly editing a large file (500Mb), which takes considerable time (e.g. using head/tail).

                          Comment

                          • lukas1848
                            Member
                            • Jun 2011
                            • 54

                            #14
                            Does anyone know whether this issue has been resolved in newer versions of BWA?

                            Comment

                            Latest Articles

                            Collapse

                            • GATTACAT
                              Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                              by GATTACAT
                              Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
                              Yesterday, 11:43 AM
                            • SEQadmin2
                              Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                              by SEQadmin2


                              I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                              Here are nine questions we think about, in roughly the order they matter, before...
                              06-18-2026, 07:11 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by SEQadmin2, Today, 11:08 AM
                            0 responses
                            6 views
                            0 reactions
                            Last Post SEQadmin2  
                            Started by SEQadmin2, 06-30-2026, 05:37 AM
                            0 responses
                            11 views
                            0 reactions
                            Last Post SEQadmin2  
                            Started by SEQadmin2, 06-26-2026, 11:10 AM
                            0 responses
                            19 views
                            0 reactions
                            Last Post SEQadmin2  
                            Started by SEQadmin2, 06-17-2026, 06:09 AM
                            0 responses
                            53 views
                            0 reactions
                            Last Post SEQadmin2  
                            Working...