Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How can i set the "mismatch" parameter in bwa

    Hello everyone
    I want to set the number of mismatch not large than 2, when I map my short reads (75bp) to reference genome sequence with bwa, how should i set in "bwa aln".

    thanks!

  • #2
    I believe you use bwa aln -n 2

    Comment


    • #3
      thank sowmyai for the reply!

      Comment


      • #4
        Originally posted by sowmyai View Post
        I believe you use bwa aln -n 2

        http://bio-bwa.sourceforge.net/bwa.shtml
        How can it be -n 2? In the manual:
        -n NUM Maximum edit distance if the value is INT, or the fraction of missing alignments given 2% uniform base error rate if FLOAT. In the latter case, the maximum edit distance is automatically chosen for different read lengths. [0.04]
        and
        -M INT Mismatch penalty. BWA will not search for suboptimal hits with a score lower than (bestScore-misMsc). [3]
        so should it be bwa aln -M 2???

        D.

        Comment


        • #5
          aln -v 2 ...?

          Comment


          • #6
            Originally posted by Chipper View Post
            aln -v 2 ...?
            Where do you get that -v 2? I dont see -v option in the list
            Code:
            bwa aln [-n maxDiff] [-o maxGapO] [-e maxGapE] [-d nDelTail] [-i nIndelEnd] [-k maxSeedDiff] [-l seedLen] [-t nThrds] [-cRN] [-M misMsc] [-O gapOsc] [-E gapEsc] [-q trimQual] <in.db.fasta> <in.query.fq> > <out.sai>

            Comment


            • #7
              Originally posted by dukevn View Post
              Where do you get that -v 2? I dont see -v option in the list
              Maybe he was confused between BWA and Bowtie; Bowtie uses -v to set the maximum number of mismatches.
              Last edited by Fedde; 11-19-2010, 05:24 AM. Reason: Fixing an incorrect statement

              Comment


              • #8
                Surely its -k 2 though this is set as standard.

                -k INT maximum differences in the seed [2]

                Comment


                • #9
                  it is -k. Set the seed length using -l (or leave to default)

                  Comment


                  • #10
                    Originally posted by kris2000 View Post
                    Surely its -k 2 though this is set as standard.

                    -k INT maximum differences in the seed [2]
                    Originally posted by Bruins View Post
                    it is -k. Set the seed length using -l (or leave to default)
                    I think this is not fully correct, using -k you usually only check the first n bases of a read, where n is set with -l. The following text is from the man page of bwa version 0.5.8:

                    Code:
                                  -l INT    Take the first INT subsequence  as  seed.  If  INT  is
                                            larger  than  the query sequence, seeding will be dis‐
                                            abled. For long reads, this option is typically ranged
                                            from 25 to 35 for `-k 2'. [inf]
                    
                                  -k INT    Maximum edit distance in the seed [2]
                    If I understand it correctly, this means that the seed length is infinity by default, which is always larger than the read length, so seeding is disabled and -k does not have any effect.

                    EDIT: I did not understand it correctly there: see post #13 on this thread.

                    This is from the same man page, a little higher:

                    Code:
                                  -n NUM    Maximum edit distance if the  value  is  INT,  or  the
                                            fraction  of  missing alignments given 2% uniform base
                                            error rate if FLOAT. In the latter case,  the  maximum
                                            edit  distance  is  automatically chosen for different
                                            read lengths. [0.04]
                    Edit distance is the number of operations required to change one of the strings (sequences) to the other one, right? So if you would use an integer as the argument to -n, you would set that as the maximum number of mismatches/gaps, if I understand it correctly after reading the manual and parts of the article (no relevant parts of the source).

                    In the description of the option -e, which I think is also relevant, it says `k-difference mode':

                    Code:
                                  -e INT    Maximum number of gap extensions, -1 for  k-difference
                                            mode (disallowing long gaps) [-1]
                    I think this refers to the argument to -n, because I see a variable called k being used on page 4 (or 1757) of the article:
                    2.7.3 Determining the allowed maximum number of differences Given a read of length m, BWA only tolerates a hit with at most k differences (mismatches or gaps), where k is chosen such that <4% of m-long reads with 2% uniform base error rate may contain differences more than k. With this configuration, for 15–37 bp reads, k equals 2; for 38–63 bp, k = 3; for 64–92 bp, k = 4; for 93–123 bp, k = 5; and for 124–156 bp reads, k = 6.
                    Last edited by Fedde; 11-19-2010, 05:49 AM. Reason: Reformulating the first paragraph for explaining more clear what I mean

                    Comment


                    • #11
                      Originally posted by Fedde View Post
                      The following text is from the man page of bwa version 0.5.8:

                      Code:
                                    -l INT    Take the first INT subsequence  as  seed.  If  INT  is
                                              larger  than  the query sequence, seeding will be dis‐
                                              abled. For long reads, this option is typically ranged
                                              from 25 to 35 for `-k 2'. [inf]
                      
                                    -k INT    Maximum edit distance in the seed [2]
                      If I understand it correctly, this means that the seed length is infinity by default, which is always larger than the read length, so seeding is disabled and -k does not have any effect.
                      I was probably wrong, the value -k seems to have an effect even if -l is left default, although I have not looked into it very extensively.

                      Comment


                      • #12
                        I am getting confused again. So what should it be: aln -k 2 or aln -n 2?

                        Comment


                        • #13
                          In BWA version 0.5.8c (r1536), the function gap_opt_t *gap_init_opt() (from bwtaln.c) contains this line:
                          Code:
                                  o->seed_len = 32; o->max_seed_diff = 2;
                          When initialising the parameters. The default seed length (-l) seems to be 32, and not something described by `inf' after all. The usage message, which actually reports the current value, also says 32:
                          Code:
                          $ bwa aln 2>&1 | grep seed
                                   -l INT    seed length [32]
                                   -k INT    maximum differences in the seed [2]
                          Also, there seems to be an off-by-one error in the seed length. When I set it to 6, a mismatch at position 6 is not counted as part of the seed, and when I set it to 7, it is.

                          In order to find reads with a maximum edit distance of 2 (both counting mismatches and gaps), without using a seed (filtering for the number of mismatches in the first n bases) use these command-line options:
                          Code:
                          $ bwa aln -n 2 -l 0
                          I think filtering only for mismatches, and not looking at the edit distance, might be unsupported.

                          Keeping the default value for -n should cause it to choose a maximum edit distance to miss only 4% of the alignments, assuming 2% uniform base error rate, and leaving -l and -k at their default values should cause it to find only alignments with no more than two mismatches in the first 31 bases, if I am not mistaken.
                          Last edited by Fedde; 11-19-2010, 05:31 AM. Reason: Making the post a little more clear

                          Comment


                          • #14
                            Originally posted by Fedde View Post
                            In BWA version 0.5.8c (r1536), the function gap_opt_t *gap_init_opt() (from bwtaln.c) contains this line:
                            Code:
                                    o->seed_len = 32; o->max_seed_diff = 2;
                            When initialising the parameters. The default seed length (-l) seems to be 32, and not something described by `inf' after all. The usage message, which actually reports the current value, also says 32:
                            Code:
                            $ bwa aln 2>&1 | grep seed
                                     -l INT    seed length [32]
                                     -k INT    maximum differences in the seed [2]
                            Also, there seems to be an off-by-one error in the seed length. When I set it to 6, a mismatch at position 6 is not counted as part of the seed, and when I set it to 7, it is.

                            In order to find reads with a maximum edit distance of 2 (both counting mismatches and gaps), without using a seed (filtering for the number of mismatches in the first n bases) use these command-line options:
                            Code:
                            $ bwa aln -n 2 -l 0
                            I think filtering only for mismatches, and not looking at the edit distance, might be unsupported.

                            Keeping the default value for -n should cause it to choose a maximum edit distance to miss only 4% of the alignments, assuming 2% uniform base error rate, and leaving -l and -k at their default values should cause it to find only alignments with no more than two mismatches in the first 31 bases, if I am not mistaken.
                            Great! Thanks for your detail answer. Unfortunately the author or some other experienced users that know the answer do not join this post to give us the confirmation. For now, I use either default or -n 2 all the time, but if I had time, I would run some tests to deeply check that.

                            Thanks,

                            D.

                            Comment


                            • #15
                              Does anybody know if BWA adjust the maximum allowable edit distance based on each read in the .fastq file?

                              As a result of adapter sequencing removal prior to alignment, we are getting a modified .fastq file that has reads of different lengths. In this case, will BWA apply a different maximum edit distance for a 36nt read than for a 86nt read?

                              BWA does print out some edit distances given different read lengths, but I wonder if BWA is selecting the edit distances dynamically or i.e. assuming that all reads are of equal length as the first read in the .fastq file.

                              Cheers!

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-27-2024, 06:37 PM
                              0 responses
                              12 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-27-2024, 06:07 PM
                              0 responses
                              11 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              53 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              69 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X