Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • genelab
    Member
    • Nov 2009
    • 27

    How can i set the "mismatch" parameter in bwa

    Hello everyone
    I want to set the number of mismatch not large than 2, when I map my short reads (75bp) to reference genome sequence with bwa, how should i set in "bwa aln".

    thanks!
  • sowmyai
    Member
    • Jan 2010
    • 27

    #2
    I believe you use bwa aln -n 2

    Comment

    • genelab
      Member
      • Nov 2009
      • 27

      #3
      thank sowmyai for the reply!

      Comment

      • dukevn
        Member
        • Apr 2009
        • 50

        #4
        Originally posted by sowmyai View Post
        I believe you use bwa aln -n 2

        http://bio-bwa.sourceforge.net/bwa.shtml
        How can it be -n 2? In the manual:
        -n NUM Maximum edit distance if the value is INT, or the fraction of missing alignments given 2% uniform base error rate if FLOAT. In the latter case, the maximum edit distance is automatically chosen for different read lengths. [0.04]
        and
        -M INT Mismatch penalty. BWA will not search for suboptimal hits with a score lower than (bestScore-misMsc). [3]
        so should it be bwa aln -M 2???

        D.

        Comment

        • Chipper
          Senior Member
          • Mar 2008
          • 323

          #5
          aln -v 2 ...?

          Comment

          • dukevn
            Member
            • Apr 2009
            • 50

            #6
            Originally posted by Chipper View Post
            aln -v 2 ...?
            Where do you get that -v 2? I dont see -v option in the list
            Code:
            bwa aln [-n maxDiff] [-o maxGapO] [-e maxGapE] [-d nDelTail] [-i nIndelEnd] [-k maxSeedDiff] [-l seedLen] [-t nThrds] [-cRN] [-M misMsc] [-O gapOsc] [-E gapEsc] [-q trimQual] <in.db.fasta> <in.query.fq> > <out.sai>

            Comment

            • Fedde
              Junior Member
              • Sep 2010
              • 5

              #7
              Originally posted by dukevn View Post
              Where do you get that -v 2? I dont see -v option in the list
              Maybe he was confused between BWA and Bowtie; Bowtie uses -v to set the maximum number of mismatches.
              Last edited by Fedde; 11-19-2010, 05:24 AM. Reason: Fixing an incorrect statement

              Comment

              • kris2000
                Junior Member
                • Sep 2010
                • 2

                #8
                Surely its -k 2 though this is set as standard.

                -k INT maximum differences in the seed [2]

                Comment

                • Bruins
                  Member
                  • Feb 2010
                  • 78

                  #9
                  it is -k. Set the seed length using -l (or leave to default)

                  Comment

                  • Fedde
                    Junior Member
                    • Sep 2010
                    • 5

                    #10
                    Originally posted by kris2000 View Post
                    Surely its -k 2 though this is set as standard.

                    -k INT maximum differences in the seed [2]
                    Originally posted by Bruins View Post
                    it is -k. Set the seed length using -l (or leave to default)
                    I think this is not fully correct, using -k you usually only check the first n bases of a read, where n is set with -l. The following text is from the man page of bwa version 0.5.8:

                    Code:
                                  -l INT    Take the first INT subsequence  as  seed.  If  INT  is
                                            larger  than  the query sequence, seeding will be dis‐
                                            abled. For long reads, this option is typically ranged
                                            from 25 to 35 for `-k 2'. [inf]
                    
                                  -k INT    Maximum edit distance in the seed [2]
                    If I understand it correctly, this means that the seed length is infinity by default, which is always larger than the read length, so seeding is disabled and -k does not have any effect.

                    EDIT: I did not understand it correctly there: see post #13 on this thread.

                    This is from the same man page, a little higher:

                    Code:
                                  -n NUM    Maximum edit distance if the  value  is  INT,  or  the
                                            fraction  of  missing alignments given 2% uniform base
                                            error rate if FLOAT. In the latter case,  the  maximum
                                            edit  distance  is  automatically chosen for different
                                            read lengths. [0.04]
                    Edit distance is the number of operations required to change one of the strings (sequences) to the other one, right? So if you would use an integer as the argument to -n, you would set that as the maximum number of mismatches/gaps, if I understand it correctly after reading the manual and parts of the article (no relevant parts of the source).

                    In the description of the option -e, which I think is also relevant, it says `k-difference mode':

                    Code:
                                  -e INT    Maximum number of gap extensions, -1 for  k-difference
                                            mode (disallowing long gaps) [-1]
                    I think this refers to the argument to -n, because I see a variable called k being used on page 4 (or 1757) of the article:
                    2.7.3 Determining the allowed maximum number of differences Given a read of length m, BWA only tolerates a hit with at most k differences (mismatches or gaps), where k is chosen such that <4% of m-long reads with 2% uniform base error rate may contain differences more than k. With this configuration, for 15–37 bp reads, k equals 2; for 38–63 bp, k = 3; for 64–92 bp, k = 4; for 93–123 bp, k = 5; and for 124–156 bp reads, k = 6.
                    Last edited by Fedde; 11-19-2010, 05:49 AM. Reason: Reformulating the first paragraph for explaining more clear what I mean

                    Comment

                    • Fedde
                      Junior Member
                      • Sep 2010
                      • 5

                      #11
                      Originally posted by Fedde View Post
                      The following text is from the man page of bwa version 0.5.8:

                      Code:
                                    -l INT    Take the first INT subsequence  as  seed.  If  INT  is
                                              larger  than  the query sequence, seeding will be dis‐
                                              abled. For long reads, this option is typically ranged
                                              from 25 to 35 for `-k 2'. [inf]
                      
                                    -k INT    Maximum edit distance in the seed [2]
                      If I understand it correctly, this means that the seed length is infinity by default, which is always larger than the read length, so seeding is disabled and -k does not have any effect.
                      I was probably wrong, the value -k seems to have an effect even if -l is left default, although I have not looked into it very extensively.

                      Comment

                      • dukevn
                        Member
                        • Apr 2009
                        • 50

                        #12
                        I am getting confused again. So what should it be: aln -k 2 or aln -n 2?

                        Comment

                        • Fedde
                          Junior Member
                          • Sep 2010
                          • 5

                          #13
                          In BWA version 0.5.8c (r1536), the function gap_opt_t *gap_init_opt() (from bwtaln.c) contains this line:
                          Code:
                                  o->seed_len = 32; o->max_seed_diff = 2;
                          When initialising the parameters. The default seed length (-l) seems to be 32, and not something described by `inf' after all. The usage message, which actually reports the current value, also says 32:
                          Code:
                          $ bwa aln 2>&1 | grep seed
                                   -l INT    seed length [32]
                                   -k INT    maximum differences in the seed [2]
                          Also, there seems to be an off-by-one error in the seed length. When I set it to 6, a mismatch at position 6 is not counted as part of the seed, and when I set it to 7, it is.

                          In order to find reads with a maximum edit distance of 2 (both counting mismatches and gaps), without using a seed (filtering for the number of mismatches in the first n bases) use these command-line options:
                          Code:
                          $ bwa aln -n 2 -l 0
                          I think filtering only for mismatches, and not looking at the edit distance, might be unsupported.

                          Keeping the default value for -n should cause it to choose a maximum edit distance to miss only 4% of the alignments, assuming 2% uniform base error rate, and leaving -l and -k at their default values should cause it to find only alignments with no more than two mismatches in the first 31 bases, if I am not mistaken.
                          Last edited by Fedde; 11-19-2010, 05:31 AM. Reason: Making the post a little more clear

                          Comment

                          • dukevn
                            Member
                            • Apr 2009
                            • 50

                            #14
                            Originally posted by Fedde View Post
                            In BWA version 0.5.8c (r1536), the function gap_opt_t *gap_init_opt() (from bwtaln.c) contains this line:
                            Code:
                                    o->seed_len = 32; o->max_seed_diff = 2;
                            When initialising the parameters. The default seed length (-l) seems to be 32, and not something described by `inf' after all. The usage message, which actually reports the current value, also says 32:
                            Code:
                            $ bwa aln 2>&1 | grep seed
                                     -l INT    seed length [32]
                                     -k INT    maximum differences in the seed [2]
                            Also, there seems to be an off-by-one error in the seed length. When I set it to 6, a mismatch at position 6 is not counted as part of the seed, and when I set it to 7, it is.

                            In order to find reads with a maximum edit distance of 2 (both counting mismatches and gaps), without using a seed (filtering for the number of mismatches in the first n bases) use these command-line options:
                            Code:
                            $ bwa aln -n 2 -l 0
                            I think filtering only for mismatches, and not looking at the edit distance, might be unsupported.

                            Keeping the default value for -n should cause it to choose a maximum edit distance to miss only 4% of the alignments, assuming 2% uniform base error rate, and leaving -l and -k at their default values should cause it to find only alignments with no more than two mismatches in the first 31 bases, if I am not mistaken.
                            Great! Thanks for your detail answer. Unfortunately the author or some other experienced users that know the answer do not join this post to give us the confirmation. For now, I use either default or -n 2 all the time, but if I had time, I would run some tests to deeply check that.

                            Thanks,

                            D.

                            Comment

                            • Yilong Li
                              Member
                              • Dec 2010
                              • 41

                              #15
                              Does anybody know if BWA adjust the maximum allowable edit distance based on each read in the .fastq file?

                              As a result of adapter sequencing removal prior to alignment, we are getting a modified .fastq file that has reads of different lengths. In this case, will BWA apply a different maximum edit distance for a 36nt read than for a 86nt read?

                              BWA does print out some edit distances given different read lengths, but I wonder if BWA is selecting the edit distances dynamically or i.e. assuming that all reads are of equal length as the first read in the .fastq file.

                              Cheers!

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                                by SEQadmin2


                                I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


                                Here are nine questions we think about, in roughly the order they matter, before...
                                Today, 07:11 AM
                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                06-02-2026, 10:05 AM
                              • SEQadmin2
                                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                by SEQadmin2


                                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                Introduction

                                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                05-22-2026, 06:42 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Yesterday, 06:09 AM
                              0 responses
                              16 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-09-2026, 11:58 AM
                              0 responses
                              36 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-05-2026, 10:09 AM
                              0 responses
                              42 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-04-2026, 08:59 AM
                              0 responses
                              49 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...