I have a RNA data set with a read lenght of 76 bp. I want to allow for more mismatches when aligning in BWA. How many mismatches does BWA allow with default setting and which parameter(s) should I change if I want to allow e.g. the mismatch number to be twice as high?? I have been playing around with aln -n, -l and -M, without any success.
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
are you aligning against a transcript database? if not, you might consider using a splice aware aligner like tophat or star:
Comment
-
if you are certain that BWA is your only option ..
the parameters are pretty clear:
Options: -n NUM max #diff (int) or missing prob under 0.02 err rate (float) [0.04]
-o INT maximum number or fraction of gap opens [1]
-e INT maximum number of gap extensions, -1 for disabling long gaps [-1]
-i INT do not put an indel within INT bp towards the ends [5]
-d INT maximum occurrences for extending a long deletion [10]
-l INT seed length [32]
-k INT maximum differences in the seed [2]
-M INT mismatch penalty [3]
-O INT gap open penalty [11]
-E INT gap extension penalty [4]
-L log-scaled gap penalty for long deletions
as far as i understand it is not possible to have less reads aligned allowing for more mismatches (-n).
Comment
-
Thanks volks. Yes, I am almost 100 percent sure that BWA is my only option. However, I am really a newbie to BWA, so I'm not sure that I understand your post. Most of the parameter settings, that you list, are default, right?
E.g. -n is 0.04 by default, and I thought that this parameter was one of the parameters that I should change, when allowing BWA to align with more mismatches? Sorry - but can you explain me again which parameters are default and which parameters I should change?
Comment
-
defaults are given in brackets [].
for starters i would disable gapped alignment (-o 0), keep the seed at length and two mismatches (-l 32, -k 2) and try various different overall mismatches (e.g. -n 3 to 6). higher -n should give you more aligned reads.
Comment
-
Ok, thanks. I will try to use the guidelines that you have given me.
So I should concentrate on changing -n (the one that is set to 0.04 as default)? I will try to set it between 3 and 6. How should this parameter be set if I want to allow e.g. twice as many mismatches per read compared to default?
I have read somewhere that it is a good a idea to also disable seeding by setting -l (10000) when allowing more mismatches - but I don't know if I should do this?
Comment
-
Hi, Karenj,
I did some test.
First thing, if you don't give any parameter to adjust, then:
Default value for n, which you saw at the beginning of output:
[bwa_aln] 17bp reads: max_diff = 2
[bwa_aln] 38bp reads: max_diff = 3
[bwa_aln] 64bp reads: max_diff = 4
[bwa_aln] 93bp reads: max_diff = 5
[bwa_aln] 124bp reads: max_diff = 6
[bwa_aln] 157bp reads: max_diff = 7
[bwa_aln] 190bp reads: max_diff = 8
[bwa_aln] 225bp reads: max_diff = 9
My data is 83bp thus n = 4, if I run with n = 8 or n = 16, I can see more reads mapped.
Now -l changes the seed length, seems doesn't work, it runs 100 times slower, and map less, -k change the mismatch within seed, giving a large number doesn't work either.
There are many more parameters you can change e.g. -o, -e, -i, -d, -M, -O, -E, the point is you do need understanding of it.
But the point of BWA is to align very fast with low error reads, if you adjust any of those listed above, it might align some hard reads, but the run time is significant LOOOOOOOONGER. Which you might better just use BWA to align first round and use another tool to align those unmapped, (like many re-aligner do).
Comment
Latest Articles
Collapse
-
by seqadmin
The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...-
Channel: Articles
04-22-2024, 07:01 AM -
-
by seqadmin
Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...-
Channel: Articles
04-04-2024, 04:25 PM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, Yesterday, 11:49 AM
|
0 responses
15 views
0 likes
|
Last Post
by seqadmin
Yesterday, 11:49 AM
|
||
Started by seqadmin, 04-24-2024, 08:47 AM
|
0 responses
16 views
0 likes
|
Last Post
by seqadmin
04-24-2024, 08:47 AM
|
||
Started by seqadmin, 04-11-2024, 12:08 PM
|
0 responses
61 views
0 likes
|
Last Post
by seqadmin
04-11-2024, 12:08 PM
|
||
Started by seqadmin, 04-10-2024, 10:19 PM
|
0 responses
60 views
0 likes
|
Last Post
by seqadmin
04-10-2024, 10:19 PM
|
Comment