Seqanswers Leaderboard Ad

**nilshomer** · 03-10-2009, 08:25 PM

I don't think anyone would dispute that aligning with 2 color errors is not difficult, and should be fast. Nevertheless, ABI data is inherently noisy and one wishes to align with 6 or mismatches. Many aligners scale exponentially in time according to the number of mismatches, and run very slow for larger (multi-gigabase) genomes.

**BioWizard** · 03-10-2009, 09:20 PM

"should be fast" but takes anywhere from 6 to 66 hours on the other alignment systems.
So many people say "our aligner is fast" but its tough to get numbers....
The ones that used to post numbers took the numbers off their web site after ISAS became publicly available (before February 2009 only Applied Biosystems had access to ISAS which they bought for their own use).

Note that ISAS has a VA (valid adjacent mode) which means that when you specify 2 mismatches it will allow 4, then filter the matches sp only 2 valid adjacent mismatches will be accepted - 2 legitimate SNPs (4 color code substitutions), 1 legitimate SNP plus 1
sequencing error (3 colorspace substitutions) or 2 sequencing errors (2 colorspace substitutions). When specifying 4 substitutions in VA mode, it searches for 6 substitutions and filters out any that are invalid under color code rules. Chaper 5 of the user's guide is a good tutorial for "valid adjacent" rules for SOLiD alignment.

**nilshomer** · 03-10-2009, 09:39 PM

There really isn't any validity to"valid adjacent rules". If you look at the MAQ, SHRiMP, and BFAST (admittedly my own), they all implement some type of dynamic programming or shortest path solution that considers any # of color errors and SNPs (and sometimes, in the case of BFAST and SHRiMP, indels). The rules are just special cases of these general algorithms, and the rules don't consider all cases, thus producing pathological errors that miss relevant biological variants.

While i agree it is great that aligner X can state that they can align Y # of reads in Z # of minutes/hours/days under the easiest of requirements (a few mismatches or color errors), it is not a valid comparison, since:
1. how does your algorithm scale to more mismatches and color errors?
2. how does your algorithm scale to longer reads (50 mate end and 75 mate end). These data exist!
3. how do you handle insertions and deletions in the alignment of one end of a read, for example to find 1-10bp indels?
These are the requirements I make sure to meet.

For example, if the (color) error rate is ~10%, then for longer reads you will have to have more tolerance to error (75bp data would require ~7.5 color error tolerance). You throw out more than 60-80% of the data if you ignore these cases.

**BioWizard** · 03-10-2009, 10:07 PM

Those are good questions.
We had to add some special algorithms for SOLiD:
We allow higher mismatches towards the end of the sequence, since SOLiD
get quite noisy towards the end, and our speed goes up when you increase the number of mismatches.

The VA rule statistically works well. Its the whole premise behind the SOLiD machine.
The tutorial explains it clearly, so even non mathematicians can understand it well.
Of course one can always overcomplicate things, usually degrading results.

I'll try to find time to come back tomorrow to see if you can post some actual numbers.
Until then, we will have to continue to accept "100M 25mers 2 mismatches 3G human on 1 machine in 30 minutes" as the world record, I suppose.

**nilshomer** · 03-10-2009, 10:23 PM

Originally posted by BioWizard View Post

The VA rule statistically works well. Its the whole premise behind the SOLiD machine.
The tutorial explains it clearly, so even non mathematicians can understand it well.
Of course one can always overcomplicate things, usually degrading results.

Again the VA rules work well, under certain conditions, those being low error rates, which is not necessarily the case for ABI SOLiD data, and therefore you miss out on a lot of the data, and in certain cases falsely align. Certainly the VA rule is NOT the premise behind the machine, rather it is a simple XOR encoding scheme (crypto or coding theory anyone?), which has been well studied by computer scientists and mathematicians alike. The VA rules comprise a heuristic, and not a complete solution, or in other words, they are the first step in truly understanding the power of two-base encoding.

Originally posted by BioWizard View Post

Until then, we will have to continue to accept "100M 25mers 2 mismatches 3G human on 1 machine in 30 minutes" as the world record, I suppose.

Here's an algorithm that can should be the world record. Call this for every read you have:

Code:

#!/bin/perl
while(defined(my $a = <STDIN>)) {
       chomp($a);
       print STDOUT "$a=>hg18 chr1 pos1\n";
}

In the above example above I aligned all 100M reads to chr 1 position 1, but none of them were correct. What thhis highlights is how many of the 100M reads do you actually align, and align correctly (sensitivity and specificity)? And how many of them *could* you have aligned if you tried searching for more mismatches, color errors, and indels (ah indels, always neglected)?

My main complaint, and this is directed to the alignment community in general, is that proper context is most often not given. There are many subtleties involved in alignment, which require serious thought when comparing alignment algorithms, none of which are generally presented (especially with X reads in Y minutes statements).

**ECO** · 03-10-2009, 10:27 PM

Originally posted by BioWizard View Post

Those are good questions.
We had to add some special algorithms for SOLiD:
We allow higher mismatches towards the end of the sequence, since SOLiD
get quite noisy towards the end, and our speed goes up when you increase the number of mismatches.

The VA rule statistically works well. Its the whole premise behind the SOLiD machine.
The tutorial explains it clearly, so even non mathematicians can understand it well.
Of course one can always overcomplicate things, usually degrading results.

I'll try to find time to come back tomorrow to see if you can post some actual numbers.
Until then, we will have to continue to accept "100M 25mers 2 mismatches 3G human on 1 machine in 30 minutes" as the world record, I suppose.

BioWizard,

I ask that you please take care with your tone towards other members, particularly those who write free tools for the community.

I understand your desire to promote your software, and I encourage you to do so with data and evidence that can be independently replicated by the community. SEQanswers will not tolerate "challenges" or lavish marketing claims from commercial users who do not make their software freely available for reciprocal testing.

**BioWizard** · 03-11-2009, 07:49 AM

Your free software is wonderful, an amazing piece of work.... but
I'm still asking my naive question:
How many hours for 100million 25mers in COLORSPACE for 2 mismatches on one computer and 3G human reference ?

Sorry if this is considered "not nice tone"... I thought this forum is meant for exchanging honest information among scientists who are all thirsty for the truth.

Anyway, this is the last time I will ask, since I never knew such a simple question would not be permitted in a scientific community.

Also, I had posted the results on some data files that another scientist on the BioInformatics forum was kind enough to refer us all to (he said it was "1000 genome" data). So everyone can verify - the calling rate was great - the data was much cleaner than what we're used to running. Our "sensitivity" is as high as it can be (without introducing bogus matches), from all the empirical tests done (and that is the only test I trust).

We do live demos for the public at trade shows (everyone is invited to San Diego to see for themselves). Its true that we cannot give away free products, as we don't have the taxpayer paying our salaries every month... at least not yet, anyone know how to get some of Obama's bailout package

But seriously, ECO is the owner of this web site, it seems - if you don't want us here providing the scientific community fresh, relevat information, if you want them to only know about the free software that doesn't make "lavish claims" (how about "bowtie ULTRA fast ULTRA compact" , no that's not lavish, its free)...
we'll be happy to leave

**MadraghRua** · 03-11-2009, 03:10 PM

Hi
I recently did a single cpu matching analysis using corona_lite v4.
So my machine is a dual cpu Intel core E6550 @ 2.33GHz
8 GB RAM + 1.7 GB swap space
openSUSE 11.0
64 bit

My F3 and R3 .csfasta files had ~18.7 and ~18.5 million reads respectively. We were looking for 4 errors, allowing for 200 bases between mates, comparing to a single 100 million base pair chromosome section, and permitting indel detection.

The run from start to finish was bound to a single cpu and the RAM was actively being used - very little swapping was seen on the machine. Overall it took 70 hours from start to finish. Overall I've seen similar performance times from MAQ. haven't tried this with BLAT or Bowtie yet.

Moral of the story is use an appropriate machine for next gen data analysis

**nilshomer** · 04-21-2009, 12:32 AM

Originally posted by BioWizard View Post

I'm still asking my naive question:
How many hours for 100million 25mers in COLORSPACE for 2 mismatches on one computer and 3G human reference ?

You have not defined your question well enough. To imply that it is not welcome is insincere and downright rude. I would ask that you define your problem, and how the resulting solution (alignments) will be judged (see discussion below). If it based on timing, or reads mapped, or a combination of the two, my previous perl program is the clear winner (though it gets the alignment right with frequency 1/10^9). So you would need produce a dataset so to show sensitivity (2 mismatches), as well as accuracy (various false positive rates).

Internally, we generate only 50mer color space reads. I suppose I could cut off the last 25 (though this wouldn't reflect true error rates). I would be happy to test your software and compare it to BFAST as well as other aligners (I have a pipeline for most aligners: BFAST MAQ, BWA, Bowtie, SOAP, etc.). Send me a link to some data, although how do you measure the quality of your alignments? How does your software scale (sensitivity) to 50mers? Will you be able to support 6 color errors or more in 50mers, and 8 or more color errors in 75mers (~10% error rate)? How do you deal with indels and what is your power to find indels of various length (1-25bp)? How do you judge the quality of your alignments (hopefully not % of reads mapped)? What "rules" or algorithm do you use to find the color errors and detect variants in each alignment? The local alignment step is the most expensive step of the color space alignment since it involves a larger search space (in the Illumina world, the most frequent error mode is mismatches or SNPs). This step can be improperly performed using various short-cuts (rule-based approaches), which have a dramatic reduction in alignment quality but reduce computational cost.

**BioWizard** · 04-21-2009, 09:27 PM

Our output is the same as ABI's alignment software - so we don't have to worry about "is it too sensitive" or "not sensitive enough". We let ABI scientists spend years (and millions of their R&D dollars) worrying about that. We only worry about making it the fastest in the world (by orders of magnitude).

I have devoted some time in an attempt to shed light to the sequencing community; I am afraid I am getting too overloaded with customers (priority one) that I cannot afford much "pro bono" work, so regretfully, I will be visiting this web site more rarely (when the boss isn't around), as much as it has been a pleasure. I also have stopped refereeing papers (man, you can spend whole weeks on some papers if you want to do an honest job, and all you get in the end is your boss complaining that you didn't get any real work done... plus a modest "thank you" from the editors, followed by another paper to review).

Cheers,

BioWizard

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News