Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • CIGAR string from BWA-SW output incorrect ?

    I am quite happy with BWA-SW, except that it seems to generate incorrect CIGAR strings (and therefore incorrect SEQ fields) in the SAM output files. Maybe, I am confused with the output. Take a look at the two examples and please help me to understand them. I have a lot more of these, if you want more complex ones.

    Example 1: reference sequence is 145 bp long, alignment starts at 92 and has 112M, meaning 92+112 > 145, right?
    @SQ SN:ref1 LN:145
    q1 0 ref1 92 0 65S112M148S * 0 0 CACGCACGCACGCACACACACACACACACACACACACACACACACACACACACACACACACACACGCACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACGGACACAGACACAGACAAAGACTCAGACACAAACTCAGACACAGACACAGATACAGACACAGACAAAGACACAGAAACTGAAACAGACACACAGACACAGACACAGACACAGACAAAGACACAGAAACTGAAACAGACACACAGACAC * AS:i:154 XS:i:154 XF:i:2 XE:i:0 XN:i:0

    Example 2: reference sequence is 139 bp long, alignment starts at 1 and has 83M1D56M, meaning 83+1+56 > 139, right?
    @SQ SN:ref2 LN:139
    q2 16 ref2 1 1 92S83M1D56M201S * 0 0 GGAATTCAAAGTAATGGAAACAAACCTAGTGGAATATAATGGAATGGAAAGGACTGGAATGTAATGGAATGGATTGGAATAAACCCGATTCCAATGCAATGGAATGGAATGGAATGGAATGGAATGGAATCGAACGGAATCAACCTGAGTTTAATGGAATGGAATGGAATGGAATGAATGGAATGGAATTCAATGTAATGGAAACAAACCGAGGGGAATATAATGTAATGGAAAGGACTGGAATGTAATGGAATGGATTGGAATCAACCCGATTCCTATGCAACGGAATGGAATGTAATGGAATGGAATGGAAAGGAATGGAAAGAACTGGAATGGAATGGAATGGAATGGAATTTAATGGAATGGAATGCAATGGAATGGAATCAACCTGAGAGGAGTGGAATGGAATGGAATGGAAATGAATGGAACCGA * AS:i:73 XS:i:69 XF:i:2 XE:i:1 XN:i:0

  • #2
    I forgot to mention that I am using bwa-0.5.8.

    Comment


    • #3
      One more question:
      Does it make sense to have an insert after a soft clipping region?

      For example, I get the CIGAR string 156S15I264M13I189M.

      Any help is appreciated. :-)

      Comment


      • #4
        internally your reference sequences are concatenated. bwasw may map sequences to the junction of adjacent sequences. bwasw identifies and fixes cigar afterwards. unfortunately, it does not always work properly. this will be fixed in future. due to this flaw, bwasw would not work well for many very short references. i am sorry.

        perhaps you may post your references and one read, so can I fix when have time. thanks.

        Comment


        • #5
          Thanks for the explanation. :-)

          The sequences were aligned to the unique sequences in the Watson genome (http://www.ncbi.nlm.nih.gov/sites/en...m=ABKV01000000). There are two example sequences in my first post. If you need more, I can either send you a SAM file with lots of weird looking CIGAR strings or just some FASTA file. Would be happy to get this problem solved, so I can use BWA-SW. :-)

          Any change that this will be fixed in the next few weeks, or should I take a look at the source code myself?
          Is there a function to get the actual alignment instead of the CIGAR string? Maybe, this would help to figure out the problem faster.

          Comment


          • #6
            I am afraid that I cannot fix that soon. If you would like to have a look, you may check the fix_cigar() function in bwtsw2_aux.c. This function trims the part of cigar standing out of the reference sequence. It is moderately standalone.

            Thank you.

            Comment


            • #7
              I have the same problem with CIGAR by using TOPHAT

              tophat -p 4 --solexa1.3-quals -o directory index fastq

              there are several lines cause error msg when I want to convert .sam to .bam

              samtools view -bS -t index.fa.fai -o tmp.bam accepted_hits.sam

              Parse error at line 2103: CIGAR and sequence length are inconsistent

              And this line 2103 looks like this:

              HWI-EAS210R:5:55:1018:1930#0 0 scaffold00002 428427 1 36M92N36M473N536870911M * 0 0 CTCATCTCTCATCTCAGAGAGGTCCTCCAGGAGCAGGAATATAAGTTGGATCAAGTACGCAATCATCTTCA 992;93?5B=<B=;7;?;:988:146.4429;50?;;17@=A@6=7::<8?(=4+45535659'5525:74 NM:i:0 XS:A:+ NS:i:0

              Comment


              • #8
                I have tried to delete
                536870911M

                But it still doesn't work.

                I have no clue where is these 536970911 from.

                Comment


                • #9
                  The problem is not solved by removing the last term of the cigar string, since then you will miss part of the alignment. You could calculate the length of the query and replace the last M value with the difference of cigar and query length.

                  I am playing around with the BWA code to fix the problem for BWA. Not sure about Tophat, though.

                  I assume the problem comes from the concatenated references as well or simply by miscalculating the cigar string when going over the boundaries.

                  Please keep us updated if you find a solution.

                  Comment


                  • #10
                    Hi,


                    Originally posted by robs View Post
                    The problem is not solved by removing the last term of the cigar string, since then you will miss part of the alignment. You could calculate the length of the query and replace the last M value with the difference of cigar and query length.
                    what do you mean of query length? I suppose query length = read length. If so, every read length=72

                    From this CIGAR "36M92N36M473N536870911M"
                    36M+36M=72, that's why I want to completely delete 536870911M


                    I am playing around with the BWA code to fix the problem for BWA. Not sure about Tophat, though.
                    I think the error is not produced by tophat, it actually happened when tophat runs bowtie.

                    I assume the problem comes from the concatenated references as well or simply by miscalculating the cigar string when going over the boundaries.

                    Please keep us updated if you find a solution.
                    I did modify the reference fasta file. becasue of samtools.
                    samtools faidx fasta_file.fa
                    [fai_build_core] line length exceeds 65535 in sequence

                    I used Bio:seqIO to reform the reference fasta file with defaults line wrapping of 60bp.

                    I am not sure that's the reason for the error msg.
                    Last edited by glacierbird; 07-14-2010, 04:34 AM.

                    Comment


                    • #11
                      6
                      M+36M=72, that's why I want to completely delete 536870911M
                      Then you also need to remove the part with N, I think. Just give it a try.

                      I did modify the reference fasta file. becasue of samtools.
                      samtools faidx fasta_file.fa
                      [fai_build_core] line length exceeds 65535 in sequence

                      I used Bio:seqIO to reform the reference fasta file with defaults line wrapping of 60bp.

                      I am not sure that's the reason for the error msg.
                      The line wrapping should not make a difference, since the sequences should be converted into internal file formats when generating a database or index. Maybe this is a restriction from the C code for parsing lines, but not sure.
                      The only reason why the reformating might have changed anything is if your file contained dos/win line breaks instead of unix ones and the program does not check for it.

                      Comment


                      • #12
                        Originally posted by lh3 View Post
                        I am afraid that I cannot fix that soon. If you would like to have a look, you may check the fix_cigar() function in bwtsw2_aux.c. This function trims the part of cigar standing out of the reference sequence. It is moderately standalone.
                        I made some changes/fixes to several files related to the BWA-SW part and was wondering if you could give me permission to commit my changes to your sourceforge SVN. Please send me a privat message if you need more details about the changes. Thanks!

                        Comment


                        • #13
                          Originally posted by glacierbird View Post
                          I have the same problem with CIGAR by using TOPHAT

                          tophat -p 4 --solexa1.3-quals -o directory index fastq

                          there are several lines cause error msg when I want to convert .sam to .bam

                          samtools view -bS -t index.fa.fai -o tmp.bam accepted_hits.sam

                          Parse error at line 2103: CIGAR and sequence length are inconsistent

                          And this line 2103 looks like this:

                          HWI-EAS210R:5:55:1018:1930#0 0 scaffold00002 428427 1 36M92N36M473N536870911M * 0 0 CTCATCTCTCATCTCAGAGAGGTCCTCCAGGAGCAGGAATATAAGTTGGATCAAGTACGCAATCATCTTCA 992;93?5B=<B=;7;?;:988:146.4429;50?;;17@=A@6=7::<8?(=4+45535659'5525:74 NM:i:0 XS:A:+ NS:i:0
                          I'm having a very similar issue with my SAM file produced from a BWA mapping.
                          Whenever there there is a more complex CIGAR string put out the quality score string is corrupted or missing, resulting in a "sequence and quality are inconsistent" error when attempting to convert to BAM.

                          I've been removing the offending mapping lines but this means repeatedly editing a large file (500Mb), which takes considerable time (e.g. using head/tail).

                          Comment


                          • #14
                            Does anyone know whether this issue has been resolved in newer versions of BWA?

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Strategies for Sequencing Challenging Samples
                              by seqadmin


                              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                              03-22-2024, 06:39 AM
                            • seqadmin
                              Techniques and Challenges in Conservation Genomics
                              by seqadmin



                              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                              Avian Conservation
                              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                              03-08-2024, 10:41 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, Yesterday, 06:37 PM
                            0 responses
                            8 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, Yesterday, 06:07 PM
                            0 responses
                            8 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-22-2024, 10:03 AM
                            0 responses
                            49 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-21-2024, 07:32 AM
                            0 responses
                            66 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X