SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
The position file formats ".clocs" and "_pos.txt"? Ist there any difference? elgor Illumina/Solexa 0 06-27-2011 07:55 AM
How to set "CISTEMATIC_ROOT" of CISTEMATIC tujchl RNA Sequencing 2 03-22-2011 02:50 AM
Helicos: "dark nucleotides" and the coverage parameter Irina Pulyakhina Bioinformatics 0 09-21-2010 03:31 AM
"Systems biology and administration" & "Genome generation: no engineering allowed" seb567 Bioinformatics 0 05-25-2010 12:19 PM
SEQanswers second "publication": "How to map billions of short reads onto genomes" ECO Literature Watch 0 06-29-2009 11:49 PM

Reply
 
Thread Tools
Old 07-13-2010, 10:55 PM   #1
genelab
Member
 
Location: honston

Join Date: Nov 2009
Posts: 27
Default How can i set the "mismatch" parameter in bwa

Hello everyone
I want to set the number of mismatch not large than 2, when I map my short reads (75bp) to reference genome sequence with bwa, how should i set in "bwa aln".

thanks!
genelab is offline   Reply With Quote
Old 07-15-2010, 06:13 AM   #2
sowmyai
Member
 
Location: America

Join Date: Jan 2010
Posts: 27
Default

I believe you use bwa aln -n 2

http://bio-bwa.sourceforge.net/bwa.shtml
sowmyai is offline   Reply With Quote
Old 07-16-2010, 11:54 PM   #3
genelab
Member
 
Location: honston

Join Date: Nov 2009
Posts: 27
Default

thank sowmyai for the reply!
genelab is offline   Reply With Quote
Old 07-27-2010, 11:18 AM   #4
dukevn
Member
 
Location: RI

Join Date: Apr 2009
Posts: 50
Default

Quote:
Originally Posted by sowmyai View Post
I believe you use bwa aln -n 2

http://bio-bwa.sourceforge.net/bwa.shtml
How can it be -n 2? In the manual:
Quote:
-n NUM Maximum edit distance if the value is INT, or the fraction of missing alignments given 2% uniform base error rate if FLOAT. In the latter case, the maximum edit distance is automatically chosen for different read lengths. [0.04]
and
Quote:
-M INT Mismatch penalty. BWA will not search for suboptimal hits with a score lower than (bestScore-misMsc). [3]
so should it be bwa aln -M 2???

D.
dukevn is offline   Reply With Quote
Old 07-27-2010, 11:43 AM   #5
Chipper
Senior Member
 
Location: Sweden

Join Date: Mar 2008
Posts: 324
Default

aln -v 2 ...?
Chipper is offline   Reply With Quote
Old 07-27-2010, 11:54 AM   #6
dukevn
Member
 
Location: RI

Join Date: Apr 2009
Posts: 50
Default

Quote:
Originally Posted by Chipper View Post
aln -v 2 ...?
Where do you get that -v 2? I dont see -v option in the list
Code:
bwa aln [-n maxDiff] [-o maxGapO] [-e maxGapE] [-d nDelTail] [-i nIndelEnd] [-k maxSeedDiff] [-l seedLen] [-t nThrds] [-cRN] [-M misMsc] [-O gapOsc] [-E gapEsc] [-q trimQual] <in.db.fasta> <in.query.fq> > <out.sai>
dukevn is offline   Reply With Quote
Old 09-17-2010, 04:00 AM   #7
Fedde
Junior Member
 
Location: The Netherlands

Join Date: Sep 2010
Posts: 5
Default

Quote:
Originally Posted by dukevn View Post
Where do you get that -v 2? I dont see -v option in the list
Maybe he was confused between BWA and Bowtie; Bowtie uses -v to set the maximum number of mismatches.

Last edited by Fedde; 11-19-2010 at 04:24 AM. Reason: Fixing an incorrect statement
Fedde is offline   Reply With Quote
Old 09-28-2010, 07:38 AM   #8
kris2000
Junior Member
 
Location: london

Join Date: Sep 2010
Posts: 2
Default

Surely its -k 2 though this is set as standard.

-k INT maximum differences in the seed [2]
kris2000 is offline   Reply With Quote
Old 09-29-2010, 12:04 AM   #9
Bruins
Member
 
Location: Groningen

Join Date: Feb 2010
Posts: 78
Default

it is -k. Set the seed length using -l (or leave to default)
Bruins is offline   Reply With Quote
Old 09-30-2010, 11:19 PM   #10
Fedde
Junior Member
 
Location: The Netherlands

Join Date: Sep 2010
Posts: 5
Default

Quote:
Originally Posted by kris2000 View Post
Surely its -k 2 though this is set as standard.

-k INT maximum differences in the seed [2]
Quote:
Originally Posted by Bruins View Post
it is -k. Set the seed length using -l (or leave to default)
I think this is not fully correct, using -k you usually only check the first n bases of a read, where n is set with -l. The following text is from the man page of bwa version 0.5.8:

Code:
              -l INT    Take the first INT subsequence  as  seed.  If  INT  is
                        larger  than  the query sequence, seeding will be dis‐
                        abled. For long reads, this option is typically ranged
                        from 25 to 35 for `-k 2'. [inf]

              -k INT    Maximum edit distance in the seed [2]
If I understand it correctly, this means that the seed length is infinity by default, which is always larger than the read length, so seeding is disabled and -k does not have any effect.

EDIT: I did not understand it correctly there: see post #13 on this thread.

This is from the same man page, a little higher:

Code:
              -n NUM    Maximum edit distance if the  value  is  INT,  or  the
                        fraction  of  missing alignments given 2% uniform base
                        error rate if FLOAT. In the latter case,  the  maximum
                        edit  distance  is  automatically chosen for different
                        read lengths. [0.04]
Edit distance is the number of operations required to change one of the strings (sequences) to the other one, right? So if you would use an integer as the argument to -n, you would set that as the maximum number of mismatches/gaps, if I understand it correctly after reading the manual and parts of the article (no relevant parts of the source).

In the description of the option -e, which I think is also relevant, it says `k-difference mode':

Code:
              -e INT    Maximum number of gap extensions, -1 for  k-difference
                        mode (disallowing long gaps) [-1]
I think this refers to the argument to -n, because I see a variable called k being used on page 4 (or 1757) of the article:
Quote:
2.7.3 Determining the allowed maximum number of differences Given a read of length m, BWA only tolerates a hit with at most k differences (mismatches or gaps), where k is chosen such that <4% of m-long reads with 2% uniform base error rate may contain differences more than k. With this configuration, for 15–37 bp reads, k equals 2; for 38–63 bp, k = 3; for 64–92 bp, k = 4; for 93–123 bp, k = 5; and for 124–156 bp reads, k = 6.

Last edited by Fedde; 11-19-2010 at 04:49 AM. Reason: Reformulating the first paragraph for explaining more clear what I mean
Fedde is offline   Reply With Quote
Old 11-18-2010, 05:49 AM   #11
Fedde
Junior Member
 
Location: The Netherlands

Join Date: Sep 2010
Posts: 5
Default

Quote:
Originally Posted by Fedde View Post
The following text is from the man page of bwa version 0.5.8:

Code:
              -l INT    Take the first INT subsequence  as  seed.  If  INT  is
                        larger  than  the query sequence, seeding will be dis‐
                        abled. For long reads, this option is typically ranged
                        from 25 to 35 for `-k 2'. [inf]

              -k INT    Maximum edit distance in the seed [2]
If I understand it correctly, this means that the seed length is infinity by default, which is always larger than the read length, so seeding is disabled and -k does not have any effect.
I was probably wrong, the value -k seems to have an effect even if -l is left default, although I have not looked into it very extensively.
Fedde is offline   Reply With Quote
Old 11-18-2010, 06:02 AM   #12
dukevn
Member
 
Location: RI

Join Date: Apr 2009
Posts: 50
Default

I am getting confused again. So what should it be: aln -k 2 or aln -n 2?
dukevn is offline   Reply With Quote
Old 11-19-2010, 03:15 AM   #13
Fedde
Junior Member
 
Location: The Netherlands

Join Date: Sep 2010
Posts: 5
Default

In BWA version 0.5.8c (r1536), the function gap_opt_t *gap_init_opt() (from bwtaln.c) contains this line:
Code:
        o->seed_len = 32; o->max_seed_diff = 2;
When initialising the parameters. The default seed length (-l) seems to be 32, and not something described by `inf' after all. The usage message, which actually reports the current value, also says 32:
Code:
$ bwa aln 2>&1 | grep seed
         -l INT    seed length [32]
         -k INT    maximum differences in the seed [2]
Also, there seems to be an off-by-one error in the seed length. When I set it to 6, a mismatch at position 6 is not counted as part of the seed, and when I set it to 7, it is.

In order to find reads with a maximum edit distance of 2 (both counting mismatches and gaps), without using a seed (filtering for the number of mismatches in the first n bases) use these command-line options:
Code:
$ bwa aln -n 2 -l 0
I think filtering only for mismatches, and not looking at the edit distance, might be unsupported.

Keeping the default value for -n should cause it to choose a maximum edit distance to miss only 4% of the alignments, assuming 2% uniform base error rate, and leaving -l and -k at their default values should cause it to find only alignments with no more than two mismatches in the first 31 bases, if I am not mistaken.

Last edited by Fedde; 11-19-2010 at 04:31 AM. Reason: Making the post a little more clear
Fedde is offline   Reply With Quote
Old 11-20-2010, 05:14 AM   #14
dukevn
Member
 
Location: RI

Join Date: Apr 2009
Posts: 50
Default

Quote:
Originally Posted by Fedde View Post
In BWA version 0.5.8c (r1536), the function gap_opt_t *gap_init_opt() (from bwtaln.c) contains this line:
Code:
        o->seed_len = 32; o->max_seed_diff = 2;
When initialising the parameters. The default seed length (-l) seems to be 32, and not something described by `inf' after all. The usage message, which actually reports the current value, also says 32:
Code:
$ bwa aln 2>&1 | grep seed
         -l INT    seed length [32]
         -k INT    maximum differences in the seed [2]
Also, there seems to be an off-by-one error in the seed length. When I set it to 6, a mismatch at position 6 is not counted as part of the seed, and when I set it to 7, it is.

In order to find reads with a maximum edit distance of 2 (both counting mismatches and gaps), without using a seed (filtering for the number of mismatches in the first n bases) use these command-line options:
Code:
$ bwa aln -n 2 -l 0
I think filtering only for mismatches, and not looking at the edit distance, might be unsupported.

Keeping the default value for -n should cause it to choose a maximum edit distance to miss only 4% of the alignments, assuming 2% uniform base error rate, and leaving -l and -k at their default values should cause it to find only alignments with no more than two mismatches in the first 31 bases, if I am not mistaken.
Great! Thanks for your detail answer. Unfortunately the author or some other experienced users that know the answer do not join this post to give us the confirmation. For now, I use either default or -n 2 all the time, but if I had time, I would run some tests to deeply check that.

Thanks,

D.
dukevn is offline   Reply With Quote
Old 01-12-2011, 02:08 PM   #15
Yilong Li
Member
 
Location: WTSI

Join Date: Dec 2010
Posts: 41
Default

Does anybody know if BWA adjust the maximum allowable edit distance based on each read in the .fastq file?

As a result of adapter sequencing removal prior to alignment, we are getting a modified .fastq file that has reads of different lengths. In this case, will BWA apply a different maximum edit distance for a 36nt read than for a 86nt read?

BWA does print out some edit distances given different read lengths, but I wonder if BWA is selecting the edit distances dynamically or i.e. assuming that all reads are of equal length as the first read in the .fastq file.

Cheers!
Yilong Li is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:16 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO