Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • jlmlj
    Junior Member
    • Dec 2009
    • 7

    How BWA handles mismatches?

    Hi everyone,

    I have 5 million raw reads (76 bp read length) per sample from Illumina platform. Now I am using BWA to align these reads to reference human genome.

    After build up the index for the human genome, I used the following bwa commands to align short reads to human genome:

    ~/…/BWA/bwa-0.5.5/bwa aln -t 30 -M 7 hs_ref.fa reads/index3.fq > 020310_bwa_m7/aln_index3.sai

    ~/…/BWA/bwa-0.5.5/bwa samse hs_ref.fa 020310_bwa_m7/aln_index3.sai reads/index3.fq > 020310_bwa_m7/index3bwa.sam

    I used parameter –M 7 in order to allow 7 mismatches in alignment, but it seemed not work. I had the same results as one when I used default –M.

    If anyone could tell me how BWA handles the mismatches and allow higher number of mismatches. I did not get it from the manual.

    Many thanks beforehand!
  • jlmlj
    Junior Member
    • Dec 2009
    • 7

    #2
    Hi all,

    Although nobody’s replied my post yet, I like to share some testing results of using different parameters of BWA, maybe this could be helpful for somebody or somebody could help me with these inputs.

    The purpose of my testing is to allow more mismatches to see if I could have more alignments (particularly alignments with repeats) in human reference genome. I modified parameters with 6 different combinations in BWA, surprised to me that I had very similar results: 49% unique alignments, ~4% multiple alignments, and about 47% reads failed to align.

    The combination I used for tests are as below:
    -M 1
    -k 6
    -k6 -l32 -m1
    -n6 -l32 -m1
    -l32 –k20 –m1 (for this test, I liked to go extreme on –k to see what happened, however, it turned out with nothing changed)

    I took a look at the unaligned reads. Some could be aligned by BLAT although some were not. Some of ones that could be aligned by BLAT have repeat markers. It seems I do lost some true alignments. I am wondering why I could not have these true alignments using BWA… Any help would be appreciated if you have a clue!
    Last edited by jlmlj; 02-08-2010, 09:15 AM.

    Comment

    • lh3
      Senior Member
      • Feb 2008
      • 686

      #3
      try

      bwa aln -n 7 -l 1000000

      This will be very slow.

      Comment

      • jlmlj
        Junior Member
        • Dec 2009
        • 7

        #4
        Originally posted by lh3 View Post
        try

        bwa aln -n 7 -l 1000000

        This will be very slow.
        Thank you so much, very excided to get the feedback from the author of this beautiful software! I am going to try it now.

        I know -n is the max number of differences (mismatches + gaps) for the whole read length, and -l is to take the first INT as seed. However, why you set INT for -l so large, like "1000000"? Thanks in advance for the explanation!

        updates:
        I have run your parameters for 20mins, it seems the progress is very very slow: it's been staying at the process of the first step:
        [bwa_aln_core] calculate SA coordinate... (I only have 1 line for the progress)
        And it's used up all 30 nodes on our cluster. So I am thinking if it is possbile to decrease a bit the number for -l...
        Thanks!
        Last edited by jlmlj; 02-05-2010, 02:15 PM.

        Comment

        • lh3
          Senior Member
          • Feb 2008
          • 686

          #5
          -l 10000 effectively disables seeding. You may try "aln -n 5". But for reads with low quality, bwa may be very slow. Its algorithm is not designed for this case.

          Comment

          • jlmlj
            Junior Member
            • Dec 2009
            • 7

            #6
            Originally posted by lh3 View Post
            -l 10000 effectively disables seeding. You may try "aln -n 5". But for reads with low quality, bwa may be very slow. Its algorithm is not designed for this case.
            Hi lh3,

            Thank you very much for the reply! So in this test, I disable the seed, BWA allows 7 mismatches for the total 75 read length, even for those low-quality bases, am I correct?

            The test has done, it took ~49hrs with 30-node cluster. However I still have results very similar to what I had in previous tests, which means I have 48% reads failed to align to anything in the human reference genome. (I counted "XT:A:U" as unique matches, and "XT:A:R" as repeat matches in the output SAM files).

            The results confuse me a lot: we should have much more repeat matches in the human genome. I am trying to figure out what unaligned reads are? It would be appreciated very much for any suggetion!
            Last edited by jlmlj; 02-08-2010, 09:17 AM.

            Comment

            • davetang
              Member
              • Jul 2010
              • 11

              #7
              Dear jlmlj,

              I used the parameters (bwa aln -n 7 -l 1000000) and I was able to align a read that had 5 mismatches to the reference. Running bwa on the default settings didn't report this alignment. So perhaps you can try taking one or two individual unaligned reads and do your tests again? Just a suggestion, if you haven't already done this.

              As a more general note, I'm new to next-gen sequencing so I'd just like to point out something I found out. When I was looking at the sam file for this alignment, the CIGAR string was 27M and that looked like a mistake to me because I knew there were mismatches in the alignment. So I looked up the documentation, and found out that the "M" can be a sequence match or mismatch. It wasn't intuitive to me, so just thought I'd point it out.

              Cheers,

              Dave

              Comment

              Latest Articles

              Collapse

              • SEQadmin2
                Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                by SEQadmin2


                I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


                Here are nine questions we think about, in roughly the order they matter, before...
                06-18-2026, 07:11 AM
              • SEQadmin2
                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                by SEQadmin2


                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                ...
                06-02-2026, 10:05 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by SEQadmin2, 06-17-2026, 06:09 AM
              0 responses
              24 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-09-2026, 11:58 AM
              0 responses
              41 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-05-2026, 10:09 AM
              0 responses
              48 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-04-2026, 08:59 AM
              0 responses
              49 views
              0 reactions
              Last Post SEQadmin2  
              Working...