Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How BWA handles mismatches?

    Hi everyone,

    I have 5 million raw reads (76 bp read length) per sample from Illumina platform. Now I am using BWA to align these reads to reference human genome.

    After build up the index for the human genome, I used the following bwa commands to align short reads to human genome:

    ~/…/BWA/bwa-0.5.5/bwa aln -t 30 -M 7 hs_ref.fa reads/index3.fq > 020310_bwa_m7/aln_index3.sai

    ~/…/BWA/bwa-0.5.5/bwa samse hs_ref.fa 020310_bwa_m7/aln_index3.sai reads/index3.fq > 020310_bwa_m7/index3bwa.sam

    I used parameter –M 7 in order to allow 7 mismatches in alignment, but it seemed not work. I had the same results as one when I used default –M.

    If anyone could tell me how BWA handles the mismatches and allow higher number of mismatches. I did not get it from the manual.

    Many thanks beforehand!

  • #2
    Hi all,

    Although nobody’s replied my post yet, I like to share some testing results of using different parameters of BWA, maybe this could be helpful for somebody or somebody could help me with these inputs.

    The purpose of my testing is to allow more mismatches to see if I could have more alignments (particularly alignments with repeats) in human reference genome. I modified parameters with 6 different combinations in BWA, surprised to me that I had very similar results: 49% unique alignments, ~4% multiple alignments, and about 47% reads failed to align.

    The combination I used for tests are as below:
    -M 1
    -k 6
    -k6 -l32 -m1
    -n6 -l32 -m1
    -l32 –k20 –m1 (for this test, I liked to go extreme on –k to see what happened, however, it turned out with nothing changed)

    I took a look at the unaligned reads. Some could be aligned by BLAT although some were not. Some of ones that could be aligned by BLAT have repeat markers. It seems I do lost some true alignments. I am wondering why I could not have these true alignments using BWA… Any help would be appreciated if you have a clue!
    Last edited by jlmlj; 02-08-2010, 09:15 AM.

    Comment


    • #3
      try

      bwa aln -n 7 -l 1000000

      This will be very slow.

      Comment


      • #4
        Originally posted by lh3 View Post
        try

        bwa aln -n 7 -l 1000000

        This will be very slow.
        Thank you so much, very excided to get the feedback from the author of this beautiful software! I am going to try it now.

        I know -n is the max number of differences (mismatches + gaps) for the whole read length, and -l is to take the first INT as seed. However, why you set INT for -l so large, like "1000000"? Thanks in advance for the explanation!

        updates:
        I have run your parameters for 20mins, it seems the progress is very very slow: it's been staying at the process of the first step:
        [bwa_aln_core] calculate SA coordinate... (I only have 1 line for the progress)
        And it's used up all 30 nodes on our cluster. So I am thinking if it is possbile to decrease a bit the number for -l...
        Thanks!
        Last edited by jlmlj; 02-05-2010, 02:15 PM.

        Comment


        • #5
          -l 10000 effectively disables seeding. You may try "aln -n 5". But for reads with low quality, bwa may be very slow. Its algorithm is not designed for this case.

          Comment


          • #6
            Originally posted by lh3 View Post
            -l 10000 effectively disables seeding. You may try "aln -n 5". But for reads with low quality, bwa may be very slow. Its algorithm is not designed for this case.
            Hi lh3,

            Thank you very much for the reply! So in this test, I disable the seed, BWA allows 7 mismatches for the total 75 read length, even for those low-quality bases, am I correct?

            The test has done, it took ~49hrs with 30-node cluster. However I still have results very similar to what I had in previous tests, which means I have 48% reads failed to align to anything in the human reference genome. (I counted "XT:A:U" as unique matches, and "XT:A:R" as repeat matches in the output SAM files).

            The results confuse me a lot: we should have much more repeat matches in the human genome. I am trying to figure out what unaligned reads are? It would be appreciated very much for any suggetion!
            Last edited by jlmlj; 02-08-2010, 09:17 AM.

            Comment


            • #7
              Dear jlmlj,

              I used the parameters (bwa aln -n 7 -l 1000000) and I was able to align a read that had 5 mismatches to the reference. Running bwa on the default settings didn't report this alignment. So perhaps you can try taking one or two individual unaligned reads and do your tests again? Just a suggestion, if you haven't already done this.

              As a more general note, I'm new to next-gen sequencing so I'd just like to point out something I found out. When I was looking at the sam file for this alignment, the CIGAR string was 27M and that looked like a mistake to me because I knew there were mismatches in the alignment. So I looked up the documentation, and found out that the "M" can be a sequence match or mismatch. It wasn't intuitive to me, so just thought I'd point it out.

              Cheers,

              Dave

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Advancing Precision Medicine for Rare Diseases in Children
                by seqadmin




                Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                12-16-2024, 07:57 AM
              • seqadmin
                Recent Advances in Sequencing Technologies
                by seqadmin



                Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                Long-Read Sequencing
                Long-read sequencing has seen remarkable advancements,...
                12-02-2024, 01:49 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 12-17-2024, 10:28 AM
              0 responses
              33 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 12-13-2024, 08:24 AM
              0 responses
              49 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 12-12-2024, 07:41 AM
              0 responses
              34 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 12-11-2024, 07:45 AM
              0 responses
              46 views
              0 likes
              Last Post seqadmin  
              Working...
              X