I don't think anyone would dispute that aligning with 2 color errors is not difficult, and should be fast. Nevertheless, ABI data is inherently noisy and one wishes to align with 6 or mismatches. Many aligners scale exponentially in time according to the number of mismatches, and run very slow for larger (multi-gigabase) genomes.
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
"should be fast" but takes anywhere from 6 to 66 hours on the other alignment systems.
So many people say "our aligner is fast" but its tough to get numbers....
The ones that used to post numbers took the numbers off their web site after ISAS became publicly available (before February 2009 only Applied Biosystems had access to ISAS which they bought for their own use).
Note that ISAS has a VA (valid adjacent mode) which means that when you specify 2 mismatches it will allow 4, then filter the matches sp only 2 valid adjacent mismatches will be accepted - 2 legitimate SNPs (4 color code substitutions), 1 legitimate SNP plus 1
sequencing error (3 colorspace substitutions) or 2 sequencing errors (2 colorspace substitutions). When specifying 4 substitutions in VA mode, it searches for 6 substitutions and filters out any that are invalid under color code rules. Chaper 5 of the user's guide is a good tutorial for "valid adjacent" rules for SOLiD alignment.
Comment
-
There really isn't any validity to"valid adjacent rules". If you look at the MAQ, SHRiMP, and BFAST (admittedly my own), they all implement some type of dynamic programming or shortest path solution that considers any # of color errors and SNPs (and sometimes, in the case of BFAST and SHRiMP, indels). The rules are just special cases of these general algorithms, and the rules don't consider all cases, thus producing pathological errors that miss relevant biological variants.
While i agree it is great that aligner X can state that they can align Y # of reads in Z # of minutes/hours/days under the easiest of requirements (a few mismatches or color errors), it is not a valid comparison, since:
1. how does your algorithm scale to more mismatches and color errors?
2. how does your algorithm scale to longer reads (50 mate end and 75 mate end). These data exist!
3. how do you handle insertions and deletions in the alignment of one end of a read, for example to find 1-10bp indels?
These are the requirements I make sure to meet.
For example, if the (color) error rate is ~10%, then for longer reads you will have to have more tolerance to error (75bp data would require ~7.5 color error tolerance). You throw out more than 60-80% of the data if you ignore these cases.
Comment
-
Those are good questions.
We had to add some special algorithms for SOLiD:
We allow higher mismatches towards the end of the sequence, since SOLiD
get quite noisy towards the end, and our speed goes up when you increase the number of mismatches.
The VA rule statistically works well. Its the whole premise behind the SOLiD machine.
The tutorial explains it clearly, so even non mathematicians can understand it well.
Of course one can always overcomplicate things, usually degrading results.
I'll try to find time to come back tomorrow to see if you can post some actual numbers.
Until then, we will have to continue to accept "100M 25mers 2 mismatches 3G human on 1 machine in 30 minutes" as the world record, I suppose.
Comment
-
Originally posted by BioWizard View PostThe VA rule statistically works well. Its the whole premise behind the SOLiD machine.
The tutorial explains it clearly, so even non mathematicians can understand it well.
Of course one can always overcomplicate things, usually degrading results.
Originally posted by BioWizard View PostUntil then, we will have to continue to accept "100M 25mers 2 mismatches 3G human on 1 machine in 30 minutes" as the world record, I suppose.
Code:#!/bin/perl while(defined(my $a = <STDIN>)) { chomp($a); print STDOUT "$a=>hg18 chr1 pos1\n"; }
My main complaint, and this is directed to the alignment community in general, is that proper context is most often not given. There are many subtleties involved in alignment, which require serious thought when comparing alignment algorithms, none of which are generally presented (especially with X reads in Y minutes statements).
Comment
-
Originally posted by BioWizard View PostThose are good questions.
We had to add some special algorithms for SOLiD:
We allow higher mismatches towards the end of the sequence, since SOLiD
get quite noisy towards the end, and our speed goes up when you increase the number of mismatches.
The VA rule statistically works well. Its the whole premise behind the SOLiD machine.
The tutorial explains it clearly, so even non mathematicians can understand it well.
Of course one can always overcomplicate things, usually degrading results.
I'll try to find time to come back tomorrow to see if you can post some actual numbers.
Until then, we will have to continue to accept "100M 25mers 2 mismatches 3G human on 1 machine in 30 minutes" as the world record, I suppose.
I ask that you please take care with your tone towards other members, particularly those who write free tools for the community.
I understand your desire to promote your software, and I encourage you to do so with data and evidence that can be independently replicated by the community. SEQanswers will not tolerate "challenges" or lavish marketing claims from commercial users who do not make their software freely available for reciprocal testing.
Comment
-
Your free software is wonderful, an amazing piece of work.... but
I'm still asking my naive question:
How many hours for 100million 25mers in COLORSPACE for 2 mismatches on one computer and 3G human reference ?
Sorry if this is considered "not nice tone"... I thought this forum is meant for exchanging honest information among scientists who are all thirsty for the truth.
Anyway, this is the last time I will ask, since I never knew such a simple question would not be permitted in a scientific community.
Also, I had posted the results on some data files that another scientist on the BioInformatics forum was kind enough to refer us all to (he said it was "1000 genome" data). So everyone can verify - the calling rate was great - the data was much cleaner than what we're used to running. Our "sensitivity" is as high as it can be (without introducing bogus matches), from all the empirical tests done (and that is the only test I trust).
We do live demos for the public at trade shows (everyone is invited to San Diego to see for themselves). Its true that we cannot give away free products, as we don't have the taxpayer paying our salaries every month... at least not yet, anyone know how to get some of Obama's bailout package
But seriously, ECO is the owner of this web site, it seems - if you don't want us here providing the scientific community fresh, relevat information, if you want them to only know about the free software that doesn't make "lavish claims" (how about "bowtie ULTRA fast ULTRA compact" , no that's not lavish, its free)...
we'll be happy to leave
Comment
-
Hi
I recently did a single cpu matching analysis using corona_lite v4.
So my machine is a dual cpu Intel core E6550 @ 2.33GHz
8 GB RAM + 1.7 GB swap space
openSUSE 11.0
64 bit
My F3 and R3 .csfasta files had ~18.7 and ~18.5 million reads respectively. We were looking for 4 errors, allowing for 200 bases between mates, comparing to a single 100 million base pair chromosome section, and permitting indel detection.
The run from start to finish was bound to a single cpu and the RAM was actively being used - very little swapping was seen on the machine. Overall it took 70 hours from start to finish. Overall I've seen similar performance times from MAQ. haven't tried this with BLAT or Bowtie yet.
Moral of the story is use an appropriate machine for next gen data analysis
Comment
-
Originally posted by BioWizard View PostI'm still asking my naive question:
How many hours for 100million 25mers in COLORSPACE for 2 mismatches on one computer and 3G human reference ?
Internally, we generate only 50mer color space reads. I suppose I could cut off the last 25 (though this wouldn't reflect true error rates). I would be happy to test your software and compare it to BFAST as well as other aligners (I have a pipeline for most aligners: BFAST MAQ, BWA, Bowtie, SOAP, etc.). Send me a link to some data, although how do you measure the quality of your alignments? How does your software scale (sensitivity) to 50mers? Will you be able to support 6 color errors or more in 50mers, and 8 or more color errors in 75mers (~10% error rate)? How do you deal with indels and what is your power to find indels of various length (1-25bp)? How do you judge the quality of your alignments (hopefully not % of reads mapped)? What "rules" or algorithm do you use to find the color errors and detect variants in each alignment? The local alignment step is the most expensive step of the color space alignment since it involves a larger search space (in the Illumina world, the most frequent error mode is mismatches or SNPs). This step can be improperly performed using various short-cuts (rule-based approaches), which have a dramatic reduction in alignment quality but reduce computational cost.
Comment
-
Our output is the same as ABI's alignment software - so we don't have to worry about "is it too sensitive" or "not sensitive enough". We let ABI scientists spend years (and millions of their R&D dollars) worrying about that. We only worry about making it the fastest in the world (by orders of magnitude).
I have devoted some time in an attempt to shed light to the sequencing community; I am afraid I am getting too overloaded with customers (priority one) that I cannot afford much "pro bono" work, so regretfully, I will be visiting this web site more rarely (when the boss isn't around), as much as it has been a pleasure. I also have stopped refereeing papers (man, you can spend whole weeks on some papers if you want to do an honest job, and all you get in the end is your boss complaining that you didn't get any real work done... plus a modest "thank you" from the editors, followed by another paper to review).
Cheers,
BioWizard
Comment
Latest Articles
Collapse
-
by seqadmin
The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...-
Channel: Articles
04-22-2024, 07:01 AM -
-
by seqadmin
Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...-
Channel: Articles
04-04-2024, 04:25 PM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, Yesterday, 11:49 AM
|
0 responses
15 views
0 likes
|
Last Post
by seqadmin
Yesterday, 11:49 AM
|
||
Started by seqadmin, 04-24-2024, 08:47 AM
|
0 responses
16 views
0 likes
|
Last Post
by seqadmin
04-24-2024, 08:47 AM
|
||
Started by seqadmin, 04-11-2024, 12:08 PM
|
0 responses
61 views
0 likes
|
Last Post
by seqadmin
04-11-2024, 12:08 PM
|
||
Started by seqadmin, 04-10-2024, 10:19 PM
|
0 responses
60 views
0 likes
|
Last Post
by seqadmin
04-10-2024, 10:19 PM
|
Comment