Seqanswers Leaderboard Ad

**BioWizard** · 03-28-2009, 08:01 AM

Hi Andris,

Sorry for not logging in here for a while... we've been overloaded recently. Every customer tells their friends, and so forth... We're moving into larger offices, so I hope we'll be able to hire more people, and reduce the load on us (in the short term its just adding MORE work). Now, to answer your good question:

Is this still for SOLiD, or Illumina ?
I'll assume SOLiD for now:
With 25mers, if you want to detect more than 2 substitutions, you need to go to VA (Valid Adjacent) mode. This will detect up to 4 color changes, and then apply VA rules to allow up to 2 SNPs (4 color code substitutions). This takes almost twice as long as regular mode (2 color code mismatches). For 25mers, it doesn't make sense to do more than 2 mismatches w/o VA because then you artificially cause repeats which are not real repeats, in other words, you lose specificity". It does make sense to use 3 mismatches for longer read lengths.
For example "50,3" (shorthand for "readlength=50 MaxAllowedMismatches=3) . Here are some run times

25,2 28 minutes
35,3 50 minutes
35,4 44 minutes
50,3 112 minutes
50,4

All the runs below are on a single old computer: 8 core (dual socket quad core) 2.0GHz Xeon with 24GB 667MHz RAM, but it does have a faster than normal hard disk (300MByte/sec). It is MUCH faster on the new Imagenix Genome Cruncher, which will be in production in about 2 weeks.

Also, why do I ask if SOLiD or Illumina ? Illumina has much lower substitution rates for 2 reasons:
1. A legitimate SNP only causes 1 base change (vs. two color code changes)
2. The raw machinbe error rate is lower, or maybe they
are just clever enough to filter out lower quality calls -
which it doesn't look like SOLiD is doing (yet).

so you run with a lower (MaxAllowedSubstitution) / (ReadLength) ratio on Illumina data. Three mismatches for Illumina is probably good for around 65mers or so. If you're interested, we'll run a test.

**BioWizard** · 04-02-2009, 08:55 AM

Hi snetmcom,

Sorry if I appear "pretentious and condescending" to you. I'm just trying to state the facts (and stimulate everyone else to state the facts on their end). I know a lot of people are working hard. That's how mankind progresses (from caves to next gen sequencers - it took a lot of hard working people).
What organization are you with ? Are you already producing hundreds of millions (or billions) of sequences ? Do you currently need a cluster (and a whole day) to do alignment ? Would it be progress if you could do it on one computer in under one hour ?
Do you know how much pollution is caused generating electricity for all those clusters that are not needed anymore ? If people just had better software, they wouldn't need "embarrasingly parallel" solutions. It also helps to have 1 good computer instead of 100 weak (performance wise, not electricity wise) ones.

Anyway, I think the entire community would benefit if you share with us your current situation. Thanks.

**And37** · 04-06-2009, 03:36 AM

Originally posted by BioWizard View Post

Sorry for not logging in here for a while... we've been overloaded recently.

Hi,

It's not a big surprise for such a software. I was interested in SOLID data, and you answered my possible sub-questions too, thanks!

**BioWizard** · 05-11-2009, 09:51 AM

We recently posted some common benchmarks for ISAS with the new Imagenix Genome Cruncher computer, side by side with a Dell server. The Genome Cruncher runs ISAS between 2 and 3 times faster. You can see at:

ISAS - Imagenix Sequence Alignment system

http://www.imagenix.com/genomics/performancebenchmarks.html

ISAS is the fastest genomic short sequencing mapping system available.

We are also organizing a 1 day workshop for ISAS users (or future ISAS users), where we will instruct in installing and running ISAS, for both Illumina and ABI. Participants are encouraged to bring their own data (on DVDs or external USB disks). If you're interested, email

[email protected]

**nilshomer** · 05-11-2009, 11:10 AM

Originally posted by BioWizard View Post

We recently posted some common benchmarks for ISAS with the new Imagenix Genome Cruncher computer, side by side with a Dell server. The Genome Cruncher runs ISAS between 2 and 3 times faster. You can see at:

ISAS - Imagenix Sequence Alignment system

http://www.imagenix.com/genomics/performancebenchmarks.html

ISAS is the fastest genomic short sequencing mapping system available.

We are also organizing a 1 day workshop for ISAS users (or future ISAS users), where we will instruct in installing and running ISAS, for both Illumina and ABI. Participants are encouraged to bring their own data (on DVDs or external USB disks). If you're interested, email

[email protected]

And here we go again. What is your accuracy (% of reads aligned correctly divided by the % of reads aligned)? What is your sensitivity (expected vs. observed)? How do you

For color space, how do you do local alignment? Is it gapped? Otherwise, what heuristic do you use to find indels?

Looks great so far; I am not surprised by the speed but it needs just a little more context before I would switch. If you need help with simulated datasets, let me know!

**BioWizard** · 05-11-2009, 06:40 PM

ISAS Benchmarks Posted

Yes, looks like here we go again. I really can't afford to spend too much time, but I'll answer this one more time as simply as I can put it:

If the run mode is L/M, then for every sequence in the input file, ALL sequences in the reference which have a length L and up to M substitutions compared with the input sequences are reported. The only exception, is if the number of repeats exceeds the maximum number of repeats allowed (e.g. 10 or 2 shown in the benchmarks). For example:

"25/2 max. repeats=10"
means Length=25 , MaxSubstitutionsAllowed=2, if there are less than 10 repeats, then all of them will be reported.

1. There is no issue of "sensitivity". In the above example, If 3 hits
are reported, then it is a mathematical fact that there are no
more out there with 2 or less substitutions. Only if 10 hits
are reported (equal to the max. repeats specified then you
know there might be more hits out there. If you care to know
the rest, you can run with a higher repeat max.

2. There is no "cheating". I am getting more tolerant to your skepticism,
now that I've seen the outputs of several other (free) alignment systems.
These are the kinds of cheating I've seen: (none of which we do)
a. When they find one hit (which maybe they think is a "good one"),
they stop searching, and report a unique hit, even though we
find multiple hits for this sequence, sometimes even with the same
number of mismatches. Then, they say they have a high percentage of
sequences that "aligned".
b. If they see a "difficult" sequence, they ignore it. Some call it "filtering"
We know that by ignoring the 0.5% most diffcult sequences, we would
approximately double the speed. Maybe we will add this as an option.
If known to the user, and actively requested, then it is not cheating.
c. Mask out the difficult parts (repeat) of the reference. We think
the user should have the power to decide to ignore high repeats
by lowering the max. repeats allowed, and not be permanently
blinded to what the vendor considers "too many repeats" (which is
all subject to the length and number of mismatches specified anyway).

ISAS has a function to generate testing data, by randomly selecting sequences from the reference and adding random "sequencing errors" (or "SNPs" if you're an optimist) up to the maximum specified. For testing, we run a billion sequences after any non-trivial source code change and compilation. Each sequence is marked with its original location. If any sequence were not found, it would have indicated a bug somewhere. I assume everyone does this.

Anyway, I really understand your skepticism now. I was amazed at the "cheating" that I saw from "famous" shareware. Maybe you had a similar experience, and became so skeptical. We don't do any cheating... our only "crime" is that we cannot give away for free

**nilshomer** · 05-11-2009, 07:24 PM

Thanks for the reply. A few more things

1.

Originally posted by BioWizard View Post

...if there are less than 10 repeats, then all of them will be reported.

What does "repeat" mean in this context? Does this mean if a 25bp read matches > 10 places with the same "best score", it is ignored? Or does this mean that if a 25bp *could* matches >10 places with up to M mismatches, it is ignored?

2.

Originally posted by BioWizard View Post

There is no "cheating". I am getting more tolerant to your skepticism,
now that I've seen the outputs of several other (free) alignment systems.
These are the kinds of cheating I've seen: (none of which we do)
a. When they find one hit (which maybe they think is a "good one"),
they stop searching, and report a unique hit, even though we
find multiple hits for this sequence, sometimes even with the same
number of mismatches. Then, they say they have a high percentage of
sequences that "aligned"

I cannot speak of other aligners, but BFAST (my free aligner) doesn't have this property as an aligner, so your generalization isn't correct (didn't your mom tell you not to generalize?). Anyways, I commend you for following the same path, which reports all hits found.

3.

Originally posted by BioWizard View Post

c. Mask out the difficult parts (repeat) of the reference. We think
the user should have the power to decide to ignore high repeats
by lowering the max. repeats allowed, and not be permanently
blinded to what the vendor considers "too many repeats" (which is
all subject to the length and number of mismatches specified anyway).

This is a dangerous path, since for example structural variation occurs more frequently in repeat regions. Giving the user the option to try to align in more and more repetitive regions is very useful, though I think if I were to review a paper, I would ask the authors to align to the full reference, since the removing repetitiveness sacrifices completeness for speed.

Also, with such a low threshold, depending on the average # hits returned, you might be introducing false-negatives.

4.

Originally posted by BioWizard View Post

ISAS has a function to generate testing data, by randomly selecting sequences from the reference and adding random "sequencing errors" (or "SNPs" if you're an optimist) up to the maximum specified. For testing, we run a billion sequences after any non-trivial source code change and compilation. Each sequence is marked with its original location. If any sequence were not found, it would have indicated a bug somewhere. I assume everyone does this.

Some sequences should not be found with ISAS since you have a hard limit of 10 for repetitive loci. Is this not true?

Originally posted by BioWizard View Post

...our only "crime" is that we cannot give away for free

It's not a crime, we all have family.

5.
If I could ask you to answer one question, and one question only: do you align with indels (for both Illumina and ABI)? If not, then by your criteria you are "cheating" since the the time complexity of ungapped local alignment is linear whereas with gaps is quadratic. If so, do you do this with color space too?

**thondeboer** · 06-24-2011, 03:48 PM

In the SAM output that we received from a customer (NIH) that was run using ISAS, we noticed that the names for the paired-end reads, which should be identical according to the SAM definition, contained a "/1" and a "/2" suffix to identify the reads...
This breaks the SAM definition I would think since the names for paired reads (if you use the RNEXT string as "=" says that the paired read's name should be the same as the first read...

Is there a way to turn off this suffix for paired reads?

Thanks,

Thon de Boer
Product manager for Strand, Makers of Avadis NGS

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 27 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 26 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News