Originally posted by earonesty
View Post
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
Perhaps that's how it's supposed to work... but repeats, artifacts and primers, N's etc seem to slow things down for me. A lot of N's - I would agree. But imagine a few... these get matched as "A or T or G or C" causing lots of potential matches, and requiring each match to be more rigorously evaluated. If you trim them from the ends... it won't hurt your alignment at all (bwa may do this in newer versions? ), but you can't trim them from the middle -- and they necessitate exhaustive searching of the possibilities.Last edited by earonesty; 04-16-2011, 11:36 AM.
Comment
-
Originally posted by earonesty View PostA lot of N's - I would agree. But imagine a few... these get matched as "A or T or G or C" causing lots of potential matches, and requiring each match to be more rigorously evaluated.
An N in the reference sequence is randomly assigned A/T/G/C.
This is from the BWA paper, as well, I verified this in the code.
Comment
-
Liang Ping and Darren Peters have pointed out that the multithreading in bwa aln is not optimal. It is fine for ~10 threads, but when you push to ~20 threads, a lot of CPU time will be wasted on synchronization. They have provided a patch which has been applied to:
Comment
-
Originally posted by dp05yk View PostThey do not get matched as "A or T or G or C". An N in an input sequence is automatically treated as a mismatch.
An N in the reference sequence is randomly assigned A/T/G/C.
This is from the BWA paper, as well, I verified this in the code.
Unlike a nucleotide mismatch, an "N" is not a "mismatch for some and a non-mismatch for others"... it's "equally a mismatch for all possible reference bases" ... which causes "more possible equal matches".
Here's a concrete example:
You can see how "ANGGCTGC" can match many more locations with equal likelihood as opposed to "ATGGCTGC" ... which will match, on average, around 4 times fewer locations... only those with T's in the second position. The first sequence would be a poor, but still passing match for many more locations. As long as there are not enough N's to discredit a whole alignment, the aligner has to consider more possible locations as potential (albeit poor) matches.
Or maybe you're right... maybe it's not the N's... maybe it's just a high error rate, or poor alignment quality, or something else. But if I every try aligning with a "uncleaned" fastq... (adapters, skewing, or especially poor quality tails) ... it's always a lot slower (and a waste of that time aligning stuff no one wants to see).
Perhaps (mostly to assure myself I'm not crazy) I'll cobble together an example that's easy to test if I have some time tomorrow.Last edited by earonesty; 04-16-2011, 05:30 PM.
Comment
-
Using wgsim, it seems I was definitely wrong about N's...
Originally posted by earonesty View PostYou can see how "ANGGCTGC" can match many more locations with equal likelihood as opposed to "ATGGCTGC" ... which will match, on average, around 4 times fewer locations... only those with T's in the second position. The first sequence would be a poor, but still passing match for many more locations. As long as there are not enough N's to discredit a whole alignment, the aligner has to consider more possible locations as potential (albeit poor) matches.
When running with a seed length of 32:
1. Extra N's don't change things much. Too many extra speeds things up as sequences fail to align.
2. Increasing error rate from .02 to .04 drops performance by 10% and 20% for the 2 seed lengths, but then going much higher causes performance to increase ... as sequences fail to align.
So the error rate is the issue I probably see when cleaning up data... not N's. N's are just an artifact within the lower-quality tiles that I thought were causing issues... but weren't. I definitely have seen bwa run terribly slow on data that wasn't cleaned up... 10-20% in simulation isn't near the effect I've seen. Can't seem to reproduce it though.Last edited by earonesty; 04-18-2011, 07:37 AM.
Comment
-
Originally posted by dp05yk View PostThey do not get matched as "A or T or G or C". An N in an input sequence is automatically treated as a mismatch.
Comment
Latest Articles
Collapse
-
by seqadmin
The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...-
Channel: Articles
04-22-2024, 07:01 AM -
-
by seqadmin
Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...-
Channel: Articles
04-04-2024, 04:25 PM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, Yesterday, 11:49 AM
|
0 responses
13 views
0 likes
|
Last Post
by seqadmin
Yesterday, 11:49 AM
|
||
Started by seqadmin, 04-24-2024, 08:47 AM
|
0 responses
16 views
0 likes
|
Last Post
by seqadmin
04-24-2024, 08:47 AM
|
||
Started by seqadmin, 04-11-2024, 12:08 PM
|
0 responses
61 views
0 likes
|
Last Post
by seqadmin
04-11-2024, 12:08 PM
|
||
Started by seqadmin, 04-10-2024, 10:19 PM
|
0 responses
60 views
0 likes
|
Last Post
by seqadmin
04-10-2024, 10:19 PM
|
Comment