SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Annotation alignment software???? targetbcell Bioinformatics 2 05-04-2011 03:15 AM
ask for gapped (indel) alignment software polyhedron General 11 03-23-2011 03:01 AM
alignment software and ref sequence mlee Bioinformatics 3 01-18-2010 04:25 AM
ISAS software f85978 Bioinformatics 4 04-08-2009 01:46 PM
So many software for alignment!!! found Bioinformatics 5 03-03-2009 06:35 AM

Reply
 
Thread Tools
Old 03-05-2009, 12:46 PM   #1
BioWizard
Member
 
Location: Houston, TX

Join Date: Mar 2009
Posts: 27
Default ISAS Alignment Software

Hi,

We're from a company called Imagenix Technologies.

We just want to give everyone a "heads up". We'll be at the NextGen Sequencing conference in San Diego in 2 weeks, doing real-time live demos of the world's fastest alignment system: ISAS

It was not easy to be able to state this confidently, as we found that most alignment software don't like to disclose how fast (or how slow) their alignment is. We are proud of our numbers and display them prominently:

"100 million 25mers in colorspace with 2 substitutions on full human genome (3GB) reference in 30 minutes on ONE computer".

And to be more specific, when Applied biosystems ran our software (which they licensed for their own use) on their Dell it took 36 minutes for 100Million 25mers straight from one of their SOLiD machines, with 2 subs, human 3G ref. When we run it on our computer (dual socket quad core Penryn), it takes 29 minutes. Next week we expect to build a new computer (that one is over a year old - thus obsolete) and hope to reach under 20 minutes for 100 million 25mers in colorspace. We cannot "brag" about 20 minutes yet... until the machine is built and we can start running. Also - longer sequences take longer to align. For example, 88 long with 4 substitutions, Illumina (basespace) data from Illumina machine: 56million plus another 56million "paired end" took 2 hours on our old computer.

So... if you're coming to San Diego, we'll be happy to see you at booth #40, next to the food stand

We encourage you to bring CDs or DVDs with cfasta or fastq files so we can show you on the spot what is fast alignment. Please bring human data, or if you want to do some other species, you'll have to bring the fa files so we can build a referecne database for your species.

Meanwhile.... we encourage your feedback, for example telling everyone how slow your current alignment is

but seriouswly, folks... maybe we can have a productive discussion on this forum ? For example, one university was so frustrated with how slow their alignment was, that they wanted to spend lots of money to build some fancy custom hardware (FPGA, and all that painfully expensive and instantly obsolete stuff), because they never heard of ISAS. So what some might think is selfish promotion on our behalf, other see as helpful info for the entire community.

Cheers !
BioWizard is offline   Reply With Quote
Old 03-05-2009, 01:07 PM   #2
Chipper
Senior Member
 
Location: Sweden

Join Date: Mar 2008
Posts: 324
Default

Looks good, is this dataset (100M CS) available for download for comparison with other aligners?
Chipper is offline   Reply With Quote
Old 03-05-2009, 02:00 PM   #3
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

Alternatively, you may align some publicly available data set and give your results, such as CPU time, memory, #aligned reads, #proper pairs and so on. I think the data here might be good (human male; 1000genomes data done by Illumina):

ftp://ftp.era.ebi.ac.uk/vol1/fastq/ERR000/ERR000589

Most aligners have to make a tradeoff between speed, memory and accuracy, especially for paired-end alignment. It would be good to show accuracy as well. This is particularly important for people who are interested in structural variations.
lh3 is offline   Reply With Quote
Old 03-05-2009, 02:02 PM   #4
BioWizard
Member
 
Location: Houston, TX

Join Date: Mar 2009
Posts: 27
Default How to get SOLiD data for alignment

As for the ABI data, its all downloadable from their web site,
although it takes forever, and its a pain to find the link. If you can't find the link, I'll search for it. As for the finite bandwidth of their web site... nothing I can do about that, ours is probably worse

As for the Illumina data, we got it from one of our customers, and although we didn't sign an NDA, I would consider it unethical to share this w/o their
permission.
BioWizard is offline   Reply With Quote
Old 03-06-2009, 04:55 AM   #5
kmay
Member
 
Location: Munich, Germany

Join Date: Aug 2008
Posts: 29
Thumbs up Some questions...

Hi BioWizard!

Impressive number, indeed!

To better understand youir ISAS i have some questions about the background of aligning.
1) 2 point mutations: do you search exhaustively for all combinations of pms in the 25mer? Are all found alignments true positives?

2) Do you mask the genome? How do you treat multiple matches? Do you keep them? What are you doing to repeats?

3) Any estimation of false negatives?

4) How do you treat InDels ? What effect has it on timings?

5) Any restrictions on read-length? If so, min/max?

6) How does it perform in sequence space? Do you consider quality files?
Cheers

Klaus

Last edited by kmay; 03-06-2009 at 05:26 AM.
kmay is offline   Reply With Quote
Old 03-06-2009, 02:12 PM   #6
BioWizard
Member
 
Location: Houston, TX

Join Date: Mar 2009
Posts: 27
Default

Thanks for the link Ih3,

I was worried that it would take forever to download, but actualy those files are quite small, only 12M 50mers, and they downloaded rather quickly. I ran each file separately, as well as both as pairs (which, in deed they turn out to be). I used the setting: 2 substitutions, max. 10 repeats. After running, I can see that the data is rather good quality, too. I will paste the "histograms" below. On the obsolete 2GHz server that R&D gets to use (while out customers get systems twice as fast as ours...) it took about 8 minutes for single files, and about 13 minutes for both together as pairs. Because I didn't know the min. or max. length between pairs I used 1 base as the min. and a ridiculously large max. of 1Mbases, I'll look at the output file to see the realistic lengths.

file=ERR000589_1.fastq
Aligned 12139786 sequences (415.8 sec.)
Wrote 12139786 aligned sequences (82.0 sec.)

Total of 12139786 sequences done in a total of 8 minutes and 18 seconds.
*** NOTE: 19490 sequences were skipped (no. of matches set to 0) because they contained invalid characters.


Hits Histogram
==== =========
0 990159
1 9122035
2 346049
3 154639
4 106323
5 86753
6 75089
7 57069
8 42622
9 33468
10+ 1125580


file=ERR000589_2.fastq
Aligned 12139786 sequences (428.8 sec.)
Wrote 12139786 aligned sequences (83.1 sec.)

Total of 12139786 sequences done in a total of 8 minutes and 32 seconds.
*** NOTE: 15844 sequences were skipped (no. of matches set to 0) because they c
ontained invalid characters.


Hits Histogram
==== =========
0 1296041
1 8883412
2 335689
3 150253
4 104689
5 84127
6 72865
7 55341
8 40498
9 32519
10+ 1084352


files=/home/Hadar/ISAS/IlluminaData/ERR000589_1.fastq,/home/Hadar/ISAS/IlluminaData/ERR000589_2.fastq,1,1000000

Aligned 12139786 sequence pairs (623.7 sec.)
Wrote 12139786 aligned sequence pairs (155.0 sec.)

Total of 12139786 sequence pairs done in a total of 12 minutes and 58 seconds.
*** NOTE: 35334 sequences were skipped (no. of matches set to 0) because they c
ontained invalid characters.


Hits Histogram
==== =========
0 2043749
1 9350603
2 275721
3 131604
4 88321
5 65691
6 48955
7 28766
8 20653
9 16101
10+ 69622



I will try to get the the paired run result file posted at:

http://www.imagenix.com/publicdata

But I will have to remove it by Monday... so please someone who has the bandwidth for this - copy it and post where for everyone. On Monday I will delete this before I get complaints

Great weekend to all !
BioWizard is offline   Reply With Quote
Old 03-06-2009, 02:41 PM   #7
BioWizard
Member
 
Location: Houston, TX

Join Date: Mar 2009
Posts: 27
Default

Hi Klaus,

We search for any "mutations" which have up to the maximum specified mismatches. In the case of the public data which I just ran, the spec was "maximum 2 substitutions". It doesn't matter in how many places
so a max mismatch of 3 can be ....x....x....x... or ...xx....x... or ...xxx....
and , of course all lesser mismatches like two ...x... or ....xx.... or one\....x...... or zero ...... when the sample was identical to the reference AND the sequencer did not make any errors.
The search is lossless, in the sense that there are no compromises or shortcuts - if anywhere in the reference there are N (in this example 50) bases with either 0, or 1 , or 2 substitutions from the searched sequence - then it will be found. The only exception: if too many hits were already found, the search is abandoned. In this example, we set the limit to 10. So if a sequence is terribly repetitive, after 10 independent locations, it will not be searched for anymore.

We do NOT mask the reference, as we consider this kind of "cheating". If the use WANTS to see 100 repeats, he has the ability to do so. We report all the repeats, up to the specified limit (this is why the output file is sooooo big). This bring an idea to my mind... if I find that I am unable to upload the results file, I'll re-run with a smaller limit (2 or 3 ?) and get a much smaller file and upload that one. So far, while I'm typing this... about 60MB (out of 1300MB) have been uploaded.

As for "false negatives", from the mathematical point of view, if you accept the assumption of "no more than m mismatches" then there are no false negatives. From the practical point of view (whatever nature can do to the sample's DNA, plus whatever disasters the sequencer can add due to its thermal/mechanical.electrical problems) then no one can ever know the worst case "false negatives". Once can easily run simulations based on one's envelope of expectations. ABI has done such simulations (maybe they know the weaknesses of their machine better than others?) and were very happy - although in their case, we added the VA (valid adjacent) function to save the color code from missing real SNPs. If you're an Illumina customer - be happy that you don't have to worry about this problem. If you're a SOLiD customer - once you understand this problem, you'll always run ISAS with VA mode turned on. Theres a 5 page technical explanation of what I am talking about, so for Illumina customers - forget this


Indel is currently not enabled. We had it enabled originally, but ABI wanted it off, which I was surprised at the time, but since then we've seen really good results w/o indel so we left it off. We can add it if customers demand it. I think it can slow down about two to three times.

Current version (3.2) readlength range:

min max
colorspace 25 60
basespace 20 93 (we have one customer who is demanding 110
so this will go up in the next version)

We don't use the quality values provided by Illumina. This can be done in the future, but first we have to see concrete evidence that it REALLY helps. I've looked at a lot of claims of how great it is, but I didn't see that it really helped. We are relying on our partner for synthetic "gold standard" tests as this is the only evidence I will trust. Some people do all kinds of "fancy" things and then say "I got more unique mapped" or "I got less repetitions" but in reality they incorrectly mapped a repeat as a unique because of disqualifying a match which was below their quality threshold. Arbitrarily deciding what is the "magic" thershold for cutting off reads is a tricky business, and I fear, not scientifically done.

Performance is faster (especially for longer reads) in basespace or "sequence space" (let's just call it "Illumina" !). In general, alignment is easier for Illumina data. ABI argues (I'm not taking sides here - I really don't know) that you save money by needing less consumables, and more computation when you do colorspace (less consumables - they say) and alignment with VA (more computation - I agree).

OK - I hope I've answered all your questions
I'm too exhausted to continue.... 179MBytes have been uploaded (out of 1300), I'll come back in an hour to check....


Quote:
Originally Posted by kmay View Post
Hi BioWizard!

Impressive number, indeed!

To better understand youir ISAS i have some questions about the background of aligning.
1) 2 point mutations: do you search exhaustively for all combinations of pms in the 25mer? Are all found alignments true positives?

2) Do you mask the genome? How do you treat multiple matches? Do you keep them? What are you doing to repeats?

3) Any estimation of false negatives?

4) How do you treat InDels ? What effect has it on timings?

5) Any restrictions on read-length? If so, min/max?

6) How does it perform in sequence space? Do you consider quality files?
Cheers

Klaus
BioWizard is offline   Reply With Quote
Old 03-06-2009, 09:14 PM   #8
BioWizard
Member
 
Location: Houston, TX

Join Date: Mar 2009
Posts: 27
Default

OK, the file (1300MB) has been uploaded.
Anyone with big space/bandwidth that can copy it from
www.imagenix.com/publicdata
and put on your site, tell me so I can remove it.
BioWizard is offline   Reply With Quote
Old 03-09-2009, 04:08 PM   #9
BioWizard
Member
 
Location: Houston, TX

Join Date: Mar 2009
Posts: 27
Default

There have been many downloads of that 1.3GB file in the last 3 days, but so far as I know... no one has volunteered to host this file for the community - where's big government when you need them

I think by this tiome tomorrow I have to delete the file

Meanwhile I want to clarify something that several people have been asking recently:

The native color space version of ISAS also has a "Valid Adjacent" mode. Maybe its the only alignment system that even implements Valid Adjacent rules so you can catch 1 snp PLUS 1 or 2 machine errors in the same SOLID sequence. Does anyone know of any other alignment system that implements the VA rules - and allows 4 substitutions instead of 2 for 25mers, so that VA can catch 1 SN plus 2 machine errors ? We'd like to know so we can acknowledge that there is another systme. We allow 4 subs so you can even catch 2 SNPs in the same sequence (and color code VA rules make sure they really are SNPs).
BioWizard is offline   Reply With Quote
Old 03-09-2009, 05:37 PM   #10
ECO
--Site Admin--
 
Location: SF Bay Area, CA, USA

Join Date: Oct 2007
Posts: 1,358
Default

Subject edited for neutrality.
ECO is offline   Reply With Quote
Old 03-12-2009, 11:57 AM   #11
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

Thanks for posting the data, BioWizard. ISAS is really impressive, especially for its high error tolerence. Few algorithms remain fast while guaranteeing to find 3 or more mismatches.

Here are some stats I get from the file you uploaded:

# reads: 24279572
# mapped reads: 21947836
# reads mapped in proper pairs (external dist.<=300bp): 18995200
# unqiue mappings: 19326957
# unique mappings that exist in proper pairs: 18116368

BTW, is the time you were quoting the CPU time on a single core or across the 8 cores?
lh3 is offline   Reply With Quote
Old 03-12-2009, 05:36 PM   #12
BioWizard
Member
 
Location: Houston, TX

Join Date: Mar 2009
Posts: 27
Default

The time was "real time" (some people call it "wall clock time"), and it was on our old 2.0GHz dual socket quad core machine, in other words 8 cores.

Its about 80 to 85 percent of that time for a 2.8GHz dual quad penryn, and it is MUCH faster on the new Imagenix Genome Cruncher machine
16 threads in one small box... I am drooling all over myself that we're constructing right now for the NextGen Sequencing show.

It sounds like it is hard to believe for many people, so we encourage everyone to bring fastq or cfasta files to see for themselves. Please gzip before putting on a DVD or CD. The DVD/CD reader is so slow that it takes more time to copy the file to hard disk than to do alignment.

Anyway - It is I who thanks you, lh3, first you were kind enough to post some public data source for us all, and then you analized the file, which I know is time consuming, and finally, your encouraging words.

If you have more data you would like us to run, as a courtesy, it would be my pleasure to run for you. Just in the next few days I am overloaded, so let's say after the S.D. show is over (end of next week). You can mail us CDs/DVDs and it would be my pleasure to run. Especially when they let me get my hands on the new machine.
BioWizard is offline   Reply With Quote
Old 03-23-2009, 10:22 AM   #13
BioWizard
Member
 
Location: Houston, TX

Join Date: Mar 2009
Posts: 27
Default

Thanks to all the people that visited our booth in the San Diego Next Gen Sequencing Conference.

I also want to thank Hadar and Ryan who performed alignments in real time for the customers, day after day, with little chance to rest.

We were able to get the new "Genome Cruncher" computer shipped to the Hilton in San Diego, and demonstrated 100 million 25mers with 2 substitutions on full human reference in 15 minutes. I wish I could have been there, but someone had to stay behind.

For all those who had to wait in line, or couldn't make it at all, we invite you to come in for personal demos. We will soon be opening a demo center that will be open to the public - kind of like a "perpetual show". We hope those of you that couldn't make it to San Diego, can make it to the next show in San Francisco. We are approx. 40 minutes from S.F. and about 15 minutes from Applied Biosystems (Forster City), or 30 minutes from Illumina (Hayward).
BioWizard is offline   Reply With Quote
Old 03-25-2009, 02:05 AM   #14
And37
Junior Member
 
Location: Hungary

Join Date: Mar 2009
Posts: 2
Default 3 subs?

Hi BioWizard,

Your results are extreme, respect. For most programs handling more substitutions seems to be more problematic, even when the matching sequences are limited to 10.

Can you give an estimate for the ISAS running for the 100M ABI data against the 3G human genome, but enabling 3 substitutions?

Thanks,
Andris
And37 is offline   Reply With Quote
Old 03-27-2009, 11:35 AM   #15
snetmcom
Senior Member
 
Location: USA

Join Date: Oct 2008
Posts: 158
Default

i'd be more interested if biowizard wasn't so pretentious and condescending. People in this field work hard.
snetmcom is offline   Reply With Quote
Old 03-28-2009, 08:01 AM   #16
BioWizard
Member
 
Location: Houston, TX

Join Date: Mar 2009
Posts: 27
Default

Hi Andris,

Sorry for not logging in here for a while... we've been overloaded recently. Every customer tells their friends, and so forth... We're moving into larger offices, so I hope we'll be able to hire more people, and reduce the load on us (in the short term its just adding MORE work). Now, to answer your good question:

Is this still for SOLiD, or Illumina ?
I'll assume SOLiD for now:
With 25mers, if you want to detect more than 2 substitutions, you need to go to VA (Valid Adjacent) mode. This will detect up to 4 color changes, and then apply VA rules to allow up to 2 SNPs (4 color code substitutions). This takes almost twice as long as regular mode (2 color code mismatches). For 25mers, it doesn't make sense to do more than 2 mismatches w/o VA because then you artificially cause repeats which are not real repeats, in other words, you lose specificity". It does make sense to use 3 mismatches for longer read lengths.
For example "50,3" (shorthand for "readlength=50 MaxAllowedMismatches=3) . Here are some run times

25,2 28 minutes
35,3 50 minutes
35,4 44 minutes
50,3 112 minutes
50,4

All the runs below are on a single old computer: 8 core (dual socket quad core) 2.0GHz Xeon with 24GB 667MHz RAM, but it does have a faster than normal hard disk (300MByte/sec). It is MUCH faster on the new Imagenix Genome Cruncher, which will be in production in about 2 weeks.

Also, why do I ask if SOLiD or Illumina ? Illumina has much lower substitution rates for 2 reasons:
1. A legitimate SNP only causes 1 base change (vs. two color code changes)
2. The raw machinbe error rate is lower, or maybe they
are just clever enough to filter out lower quality calls -
which it doesn't look like SOLiD is doing (yet).

so you run with a lower (MaxAllowedSubstitution) / (ReadLength) ratio on Illumina data. Three mismatches for Illumina is probably good for around 65mers or so. If you're interested, we'll run a test.
BioWizard is offline   Reply With Quote
Old 04-02-2009, 08:55 AM   #17
BioWizard
Member
 
Location: Houston, TX

Join Date: Mar 2009
Posts: 27
Default

Hi snetmcom,

Sorry if I appear "pretentious and condescending" to you. I'm just trying to state the facts (and stimulate everyone else to state the facts on their end). I know a lot of people are working hard. That's how mankind progresses (from caves to next gen sequencers - it took a lot of hard working people).
What organization are you with ? Are you already producing hundreds of millions (or billions) of sequences ? Do you currently need a cluster (and a whole day) to do alignment ? Would it be progress if you could do it on one computer in under one hour ?
Do you know how much pollution is caused generating electricity for all those clusters that are not needed anymore ? If people just had better software, they wouldn't need "embarrasingly parallel" solutions. It also helps to have 1 good computer instead of 100 weak (performance wise, not electricity wise) ones.

Anyway, I think the entire community would benefit if you share with us your current situation. Thanks.
BioWizard is offline   Reply With Quote
Old 04-06-2009, 03:36 AM   #18
And37
Junior Member
 
Location: Hungary

Join Date: Mar 2009
Posts: 2
Default

Quote:
Originally Posted by BioWizard View Post
Sorry for not logging in here for a while... we've been overloaded recently.
Hi,

It's not a big surprise for such a software. I was interested in SOLID data, and you answered my possible sub-questions too, thanks!
And37 is offline   Reply With Quote
Old 05-11-2009, 09:51 AM   #19
BioWizard
Member
 
Location: Houston, TX

Join Date: Mar 2009
Posts: 27
Default

We recently posted some common benchmarks for ISAS with the new Imagenix Genome Cruncher computer, side by side with a Dell server. The Genome Cruncher runs ISAS between 2 and 3 times faster. You can see at:

http://www.imagenix.com/genomics/per...enchmarks.html

We are also organizing a 1 day workshop for ISAS users (or future ISAS users), where we will instruct in installing and running ISAS, for both Illumina and ABI. Participants are encouraged to bring their own data (on DVDs or external USB disks). If you're interested, email

workshop@imagenix.com
BioWizard is offline   Reply With Quote
Old 05-11-2009, 11:10 AM   #20
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by BioWizard View Post
We recently posted some common benchmarks for ISAS with the new Imagenix Genome Cruncher computer, side by side with a Dell server. The Genome Cruncher runs ISAS between 2 and 3 times faster. You can see at:

http://www.imagenix.com/genomics/per...enchmarks.html

We are also organizing a 1 day workshop for ISAS users (or future ISAS users), where we will instruct in installing and running ISAS, for both Illumina and ABI. Participants are encouraged to bring their own data (on DVDs or external USB disks). If you're interested, email

workshop@imagenix.com
And here we go again. What is your accuracy (% of reads aligned correctly divided by the % of reads aligned)? What is your sensitivity (expected vs. observed)? How do you

For color space, how do you do local alignment? Is it gapped? Otherwise, what heuristic do you use to find indels?

Looks great so far; I am not surprised by the speed but it needs just a little more context before I would switch. If you need help with simulated datasets, let me know!
nilshomer is offline   Reply With Quote
Reply

Tags
alignment, imagenix, isas, mapping, solid solexa

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:41 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO