SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
ERANGE and other packages for RNAseq analysis warrenemmett RNA Sequencing 9 07-02-2013 01:58 PM
Software packages capable of aligning roughly 9000 bp josecolquitt Bioinformatics 4 05-18-2010 05:17 AM
DNAnexus free account: next-gen sequence analysis in the cloud DNAnexus Vendor Forum 0 04-27-2010 11:46 PM
Sequence Analysis Software Developer Cofactor Genomics Industry Jobs! 0 01-27-2010 10:02 AM
Companies offering next gen sequence analysis services gavin.oliver Bioinformatics 8 01-12-2010 05:27 AM

Closed Thread
 
Thread Tools
Old 08-08-2008, 03:39 PM   #61
spirit
Member
 
Location: Canada

Join Date: Feb 2008
Posts: 11
Default

Thank you for your interest. I will answer these questions as I could.

1. What are the longest and shortest reads it can handle effectively?

Now, ZOOM could handle reads of length ranging from 15bp to 64bp. In fact, the kernel idea of ZOOM is quite easy to be extended to longer reads. It is the implementation that limits the length to be no more than 64bp. We will come to the 454 data later after the version for Illumina/Solexa and ABI SOLiD is stable.

2. how does it compare to Eland or MAQ in reads aligned per minute?

Since ELAND is the fastest software to deal with Illumina/Solexa data as we know, we compare the speed with ELAND in our benchmark. By mapping reads of length 15bp to 32bp with same sensitivity, ZOOM took half time of ELAND, even 1/3 when short reads are concerned. Furthermore, ELAND can only deal with no more than about 16 million reads. ZOOM has no limitation on the reads number as long as your RAM accepts. Both ELAND and ZOOM hash read and scan the reference sequence. So, if you process more reads in one scan pass, you could even save more time. Since the speed of ZOOM correlates closely to the length of reference sequence and the read length, itís hard to give the number of reads aligned per minutes. To give you an impression, there is some data from our benchmark. When achieving full sensitivity of two mismatches:
It aligns 3.4 million reads of 36bp BAC reads to the 162k region (where the BAC comes from) in 37 seconds with 1.1G RAM.
It aligns 24 million reads of 36bp (5X of human chromosome 6) to chromosome 6 in 17 minutes 17 seconds with 6.5G RAM.
It aligns 22 million reads of 17bp CHIP-SEQ data to whole human genome in 4 hours and 22 minutes with 4.2G RAM.
For ABI/SOLiD data, the speed is slower than Illumina/Solexa data. ZOOM aligns 28 million reads of 25bp to E.coli genome(4M) with automatic sequencing error correction in 5 minutes.
We tried to compare the speed and sensitivity with MAQ since itís famous. However, I am totally puzzled with its input format and output format. So lazy me gave up since its website declare itís slower than ELAND.

3. How many mismatches does it handle?

In principle, you can decide the mismatch number as you like as long as it is less than the read length.  ZOOM guarantee 100% sensitivity for a large range of <read length, mismatch number> cases.
When mismatches required is larger than the mismatch number in the cases of <read length, mismatch number> ZOOM used, sensitivity will decrease slightly. For example, mapping read of length 50bp could achieve 100% sensitivity with 4 mismatches. If you require 5 mismatches, then the sensitivity will decrease slightly. However, if you do need 100% sensitivity in these cases, feel free to contact us, we will satisfy you.

4. Does it have a gapped mode?
Yes, ZOOM can handle insertion/deletion between reads and the reference sequence. For Illumina/Solexa data, one gap but with any length you wish are allowed besides mismatches required. However, ZOOM canít guarantee 100% sensitivity to find alignments with gap. I think nobody using filtering strategy could. 

5. What format is required for the reference genome?

The format of reference genome would be a fasta file or multiple fasta files.
The format of Illumina/Solexa reads file can be in fasta, *_seq.txt or *_prb.txt. The format of ABI SOLiD *.csfasta is supported too.

6. What format are the alignments reported in?

For Illumina/Solexa data, the output of this release of ZOOM is reported in the format of ďread_name reference_seq_name: position_of_mapped +/- mismatch_numberĒ . If assembly is required, ZOOM will output the assembly consensus, coverage and frequency of {A,C,T,G} on each position of consensus.
For ABI/SOLiD data, besides the alignment information, ZOOM could output the reads decoded into the base space, with polymorphism on base space and sequencing error on color space highlighted.
In our next release, we will show the alignment in a GUI view showing the multiple alignment of mapped reads on the reference sequence and those heterozygous sites.

7. Can you comment on the cost/licenses it will be provided under?

About the cost of full version of ZOOM, maybe itís a better way to ask the sales person when the website is ready next week.  I think an academic-free version for Illumina/Solexa data with limited function will be provided too.

8. Can you give us the link to the download when it's ready?

Sure. I will offer the latest news when itís ready.


Quote:
Originally Posted by apfejes View Post
Thanks for the update, spirit.

Maybe you could give us a little bit of information on Zoom, as well, since things may have changed since last time I heard anything about it.

What are the longest and shortest reads it can handle effectively?
how does it compare to Eland or MAQ in reads aligned per minute?
How many mismatches does it handle?
Does it have a gapped mode?
What format is required for the reference genome?
What format are the alignments reported in?
Can you comment on the cost/licenses it will be provided under?
Can you give us the link to the download when it's ready?

I'm sure I'm missing other important information, but those are the first questions that occur to me.

Thanks!
spirit is offline  
Old 08-08-2008, 04:08 PM   #62
apfejes
Senior Member
 
Location: Oakland, California

Join Date: Feb 2008
Posts: 236
Default

(ECO, if you're reading this, spirit's reply might be deserving of it's own thread.)

Hi Spirit,

First of all, thanks for the long and complete answer. It sounds like a fantastic program - I'll definitely give it a try when it's available. In the meantime, I hope I'm not bothering you with too many questions. I'm very interested in giving Zoom a try, but don't want to evaluate software that won't meet my requirements.

Just to touch on a few points (comments and questions):

We're exclusively an Illumina/Solexa shop, at the moment, and we're starting to produce reads longer than 64bp. If it goes well, I don't think we'll be doing short read runs on the Illumina machines for much longer (Maybe just for chip-seq?) Anyhow, this will be important to Illumina users VERY soon.

I've never come across Eland having a 16M sequence limit - and I'm not sure why it's important, since it's trivial to run it once for each lane, anyhow. I've heard this claim from someone else when discussing zoom, so I thought I'd mention it.

As for the benchmarks, they sound pretty good, but aren't quite describing the "normal" situation we seem to come across most of the time. Would you be able to give the time for, say 6 or 8Million 36 or 42bp reads aligned to the complete human genome? What hardware was that benchmark run on? Is the code threaded for multi-cpu computers and/or does it use MPI?

Does Zoom take advantage of illumina probability/quality scores when doing the alignments?

Can Zoom handle Mulitple alignments?

and

Also related to benchmarks, how much slower is the application in gapped mode vs. un-gapped mode?

Finally, just out of curiosity, why are you putting a GUI on an aligner?

Thanks!
Anthony
__________________
The more you know, the more you know you don't know. óAristotle
apfejes is offline  
Old 08-09-2008, 05:27 AM   #63
kmay
Member
 
Location: Munich, Germany

Join Date: Aug 2008
Posts: 29
Default

Anthony,

some Illumina real life benchmarks are posted here
kmay is offline  
Old 08-09-2008, 04:08 PM   #64
apfejes
Senior Member
 
Location: Oakland, California

Join Date: Feb 2008
Posts: 236
Default

Hi Kmay,

I don't see anything at that link about the Zoom aligner, which is what I was asking for benchmarks against.

Did I miss something?

Anthony
__________________
The more you know, the more you know you don't know. óAristotle
apfejes is offline  
Old 08-10-2008, 08:15 PM   #65
spirit
Member
 
Location: Canada

Join Date: Feb 2008
Posts: 11
Default

Hi, Anthony,

Sorry for the late reply. I was out for weekend. You are THE Anthony Fejes! Your blog enlightened us a lot!!! Thank you! We like it.

The results on the benchmark above were gotten on a single core of AMD Opteron 275 CPU (2.2GHz) with 8G memory. The code is not multi CPU threaded yet. If you want to parallelize, currently you need to divide the data set and run ZOOM for multiple times.

I came across the 16M sequence limit using ELAND 0.2.2.5. For ZOOM, the speed correlated much with the times the reference genome is scanned. For example, the time used for 20 million reads input directly is much less than the summation of the time used for 10 million reads twice.

We have only one data set of reads of 36bp, which contain 3.4 million reads. So I simulated data sets by randomly picking 6 million and 8 million segments of human genome with two mismatches. When mapping back to the human genome using a single core (2.8GHz) of AMD Opteron(tm) Processor 2220, the results with 100% sensitivity is as following(the time is denoted in the format of hours:minutes:seconds ):

6 million reads of 36bp 6 million reads of 42bp
01:34:18 ( 2.08G RAM ) 01:04:11 ( 1.90G )

8 million reads of 36bp 8 million reads of 42bp
01:53:21 ( 2.42G RAM ) 01:18:11 (2.19G)

Do you want to have a look with the time usage when 3 or 4 mismatches allowed? If yes, I will show later.

In ZOOM, user can choose to take advantage of Illumina quality scores. Now, ZOOM uses a specified threshold by user to differentiate high quality bases from low quality bases. ZOOM will ignore mismatches at low quality bases (without sacrificing much to program efficiency), since mismatches at low quality bases are likely due to sequencing errors. However, quality scores are not considered when doing assembly, unlike MAQ. Maybe MAQ's way is a better way.

Yes, ZOOM can produce multiple alignment matches for each read. It can report unique or top-N best mapping results for each read.

In common, the gapped mode is five times slower than the un-gapped mode. We'll accelerate DP later since when read length gets much longer, we are expecting more gaps.

Well, for the reasons of adding GUI, hu~~, the first one comes to me is that, all software of BioinformaticsSolutions have a GUI, PatternHunter, PEAKs... Not a good reason, right? Here are three more:
1. It will make those who are not familiar with linux or command line feel more easier .
2. If there is a GUI, you can run ZOOM on your desktop computer, to monitor progress or automatically submit/control multiple jobs on your multi-server cluster.
3. Later, ZOOM will go beyond an aligner. The post processing of mapping results will be integrated, such as the SNP finding, small RNA finding or CHIP-SEQ etc. Maybe the GUI way will be more institutive.

I have discussed with Zefeng Zhang, the main developer of ZOOM. It's not so difficult to extend the read length to 128bp~256bp in ZOOM. Since you will need the support for longer read very soon, we'll put it in the first flight after this version is released next week.

Hao Lin



Quote:
Originally Posted by apfejes View Post
(ECO, if you're reading this, spirit's reply might be deserving of it's own thread.)

Hi Spirit,

First of all, thanks for the long and complete answer. It sounds like a fantastic program - I'll definitely give it a try when it's available. In the meantime, I hope I'm not bothering you with too many questions. I'm very interested in giving Zoom a try, but don't want to evaluate software that won't meet my requirements.

Just to touch on a few points (comments and questions):

We're exclusively an Illumina/Solexa shop, at the moment, and we're starting to produce reads longer than 64bp. If it goes well, I don't think we'll be doing short read runs on the Illumina machines for much longer (Maybe just for chip-seq?) Anyhow, this will be important to Illumina users VERY soon.

I've never come across Eland having a 16M sequence limit - and I'm not sure why it's important, since it's trivial to run it once for each lane, anyhow. I've heard this claim from someone else when discussing zoom, so I thought I'd mention it.

As for the benchmarks, they sound pretty good, but aren't quite describing the "normal" situation we seem to come across most of the time. Would you be able to give the time for, say 6 or 8Million 36 or 42bp reads aligned to the complete human genome? What hardware was that benchmark run on? Is the code threaded for multi-cpu computers and/or does it use MPI?

Does Zoom take advantage of illumina probability/quality scores when doing the alignments?

Can Zoom handle Mulitple alignments?

and

Also related to benchmarks, how much slower is the application in gapped mode vs. un-gapped mode?

Finally, just out of curiosity, why are you putting a GUI on an aligner?

Thanks!
Anthony

Last edited by spirit; 08-11-2008 at 09:29 AM.
spirit is offline  
Old 08-11-2008, 09:28 AM   #66
apfejes
Senior Member
 
Location: Oakland, California

Join Date: Feb 2008
Posts: 236
Default

Hi spirit.

I didn't know that my blog was THAT famous.... certainly I didn't think I had that many people reading it. I'm glad to know it's been useful to you, however. (=

As for the benchmarks, thank you VERY much for posting them. I'm very excited to try your program when I return from my vacation at the end of August. In the meantime, I'll let other people at the GSC know to check for the demo. Unfortunately, not everyone here is Academic, so I'm not sure how that will work for your license.

Just to try to keep this thread short:

* 3-4 mismatch benchmarks: I don't think I'll need them yet, but I will give it a try on my own data when I have the application.

* Probabilities: Sounds good. I'll think about this for a while. Your approach sounds relatively simple, but may be "good enough" for now. The only way to know is to try it out.

* Multiple Alignments: EXCELLENT!

* Gapped mode: I think 5x slower is not a bad price to pay, particularly if the unique matches are filtered out in a first pass. That's very encouraging!

* Gui:Thanks for answering my question. The first couple of reasons seem pretty weak, but I would certainly believe the last one has some merit. I don't really see anyone at the GSC using a GUI for any of those reasons, because of the high throughput volume we use, but I'm sure there are plenty of other people who would appreciate it.

Thanks again for your answers and for responding to my comments and suggestions so quickly! I'm looking forward to giving ZOOM a try.

Anthony
__________________
The more you know, the more you know you don't know. óAristotle
apfejes is offline  
Old 08-20-2008, 11:45 AM   #67
xxqtony
Junior Member
 
Location: CA

Join Date: Jun 2008
Posts: 9
Default

Any package available for (NGS) SAGE tag mapping to RefSeq/genome etc? Thanks.
xxqtony is offline  
Old 08-21-2008, 07:45 AM   #68
Janine Voyer
Junior Member
 
Location: Waterloo, Ontario, Canada

Join Date: Aug 2008
Posts: 2
Default

There has been some discussion in this thread about ZOOM. I wanted to let everyone know that the demo is now available. Please send me an email at [email protected] to request a 30 day free demo.
Janine Voyer is offline  
Old 08-21-2008, 07:49 AM   #69
kmay
Member
 
Location: Munich, Germany

Join Date: Aug 2008
Posts: 29
Default

-> xxqtony:

The Genomatix Mapping Station could do it for you. If you can get us your data, weīd happy to map them for you. You can see this as a test case and share your experiences here in the forum...

Cheers

Klaus
kmay is offline  
Old 08-21-2008, 08:39 AM   #70
ECO
--Site Admin--
 
Location: SF Bay Area, CA, USA

Join Date: Oct 2007
Posts: 1,352
Default

Attention those with commercial interests posting in this thread.

Please check out this thread: Towards Forming a Policy on Commerical Posts (OPEN FOR DISCUSSION)

Also, I welcome comments from anyone else on that topic!
ECO is offline  
Old 08-27-2008, 05:28 PM   #71
hongxu
Junior Member
 
Location: Hangzhou,China

Join Date: Mar 2008
Posts: 1
Default

Hi all,

another good tool for ChIPSeq analysis is:
http://dir.nhlbi.nih.gov/papers/lmi/epigenomes/sissrs/
hongxu is offline  
Old 08-28-2008, 02:12 AM   #72
Chipper
Senior Member
 
Location: Sweden

Join Date: Mar 2008
Posts: 324
Default

I tried the SISSRS method now, but as far as I can tell it does nothing more than produce a list of peakmaxima from aligned positions.
Chipper is offline  
Old 09-21-2008, 05:22 AM   #73
motan
Junior Member
 
Location: UK

Join Date: Sep 2008
Posts: 3
Default Something to add to your list

Something to add to your list.

Few months ago my lab purchased a site license of <edited by ECO>.

It is a very innovative assembler (automatic batch assembly, automatic mismatch correction, automatic low quality ends trimming and other stuff).

I think the web address was <edited by ECO>.

Last edited by ECO; 09-21-2008 at 11:07 PM. Reason: Clearly shill posting off-topic commercial content.
motan is offline  
Old 09-21-2008, 11:01 PM   #74
myrna
Member
 
Location: Vancouver, Canada

Join Date: Feb 2008
Posts: 44
Default

Quote:
Originally Posted by motan View Post
Something to add to your list.

Few months ago my lab purchased a site license of <edited by ECO>.
It is a very innovative assembler (automatic batch assembly, automatic mismatch correction, automatic low quality ends trimming and other stuff).
I think the web address was <edited by ECO>
This seems pretty off-topic. From the information on the DNAbaser website, it only handles capillary sequence data. Not at all what this forum is about.
myrna is offline  
Old 09-21-2008, 11:04 PM   #75
ECO
--Site Admin--
 
Location: SF Bay Area, CA, USA

Join Date: Oct 2007
Posts: 1,352
Default

Quote:
Originally Posted by myrna View Post
This seems pretty off-topic. From the information on the DNAbaser website, it only handles capillary sequence data. Not at all what this forum is about.
Good catch myrna.

Looks like he's acting (poorly) as a shill for software he developed:

http://www.vadino.com/education/misc/dna-baser.html

...the email on the right of that page corresponds to the one he used to register on this site.
ECO is offline  
Old 09-22-2008, 02:55 AM   #76
TheLight
Junior Member
 
Location: USA

Join Date: Sep 2008
Posts: 5
Default

Quote:
Originally Posted by spirit View Post
Thank you for your interest. I will answer these questions as I could.

1. What are the longest and shortest reads it can handle effectively?

Now, ZOOM could handle reads of length ranging from 15bp to 64bp. In fact, the kernel idea of ZOOM is quite easy to be extended to longer reads. It is the implementation that limits the length to be no more than 64bp. We will come to the 454 data later after the version for Illumina/Solexa and ABI SOLiD is stable.

2. how does it compare to Eland or MAQ in reads aligned per minute?

Since ELAND is the fastest software to deal with Illumina/Solexa data as we know, we compare the speed with ELAND in our benchmark. By mapping reads of length 15bp to 32bp with same sensitivity, ZOOM took half time of ELAND, even 1/3 when short reads are concerned. Furthermore, ELAND can only deal with no more than about 16 million reads. ZOOM has no limitation on the reads number as long as your RAM accepts. Both ELAND and ZOOM hash read and scan the reference sequence. So, if you process more reads in one scan pass, you could even save more time. Since the speed of ZOOM correlates closely to the length of reference sequence and the read length, itís hard to give the number of reads aligned per minutes. To give you an impression, there is some data from our benchmark. When achieving full sensitivity of two mismatches:
It aligns 3.4 million reads of 36bp BAC reads to the 162k region (where the BAC comes from) in 37 seconds with 1.1G RAM.
It aligns 24 million reads of 36bp (5X of human chromosome 6) to chromosome 6 in 17 minutes 17 seconds with 6.5G RAM.
It aligns 22 million reads of 17bp CHIP-SEQ data to whole human genome in 4 hours and 22 minutes with 4.2G RAM.
For ABI/SOLiD data, the speed is slower than Illumina/Solexa data. ZOOM aligns 28 million reads of 25bp to E.coli genome(4M) with automatic sequencing error correction in 5 minutes.
We tried to compare the speed and sensitivity with MAQ since itís famous. However, I am totally puzzled with its input format and output format. So lazy me gave up since its website declare itís slower than ELAND.

3. How many mismatches does it handle?

In principle, you can decide the mismatch number as you like as long as it is less than the read length.  ZOOM guarantee 100% sensitivity for a large range of <read length, mismatch number> cases.
When mismatches required is larger than the mismatch number in the cases of <read length, mismatch number> ZOOM used, sensitivity will decrease slightly. For example, mapping read of length 50bp could achieve 100% sensitivity with 4 mismatches. If you require 5 mismatches, then the sensitivity will decrease slightly. However, if you do need 100% sensitivity in these cases, feel free to contact us, we will satisfy you.

4. Does it have a gapped mode?
Yes, ZOOM can handle insertion/deletion between reads and the reference sequence. For Illumina/Solexa data, one gap but with any length you wish are allowed besides mismatches required. However, ZOOM canít guarantee 100% sensitivity to find alignments with gap. I think nobody using filtering strategy could. 

5. What format is required for the reference genome?

The format of reference genome would be a fasta file or multiple fasta files.
The format of Illumina/Solexa reads file can be in fasta, *_seq.txt or *_prb.txt. The format of ABI SOLiD *.csfasta is supported too.

6. What format are the alignments reported in?

For Illumina/Solexa data, the output of this release of ZOOM is reported in the format of ďread_name reference_seq_name: position_of_mapped +/- mismatch_numberĒ . If assembly is required, ZOOM will output the assembly consensus, coverage and frequency of {A,C,T,G} on each position of consensus.
For ABI/SOLiD data, besides the alignment information, ZOOM could output the reads decoded into the base space, with polymorphism on base space and sequencing error on color space highlighted.
In our next release, we will show the alignment in a GUI view showing the multiple alignment of mapped reads on the reference sequence and those heterozygous sites.

7. Can you comment on the cost/licenses it will be provided under?

About the cost of full version of ZOOM, maybe itís a better way to ask the sales person when the website is ready next week.  I think an academic-free version for Illumina/Solexa data with limited function will be provided too.

8. Can you give us the link to the download when it's ready?

Sure. I will offer the latest news when itís ready.
Wow. You are well documented.
TheLight is offline  
Old 10-16-2008, 02:04 PM   #77
ECO
--Site Admin--
 
Location: SF Bay Area, CA, USA

Join Date: Oct 2007
Posts: 1,352
Default

Added two more ChIP-SEQ tools.
ECO is offline  
Old 10-20-2008, 03:42 PM   #78
jerryliu
Junior Member
 
Location: Maryland

Join Date: Aug 2008
Posts: 4
Default

Hi everyone,

I'm new to this forum. Thanks to Sci_guy for posting this one-stop-shop article of NextGen sw packages. It is very useful.

We just started to use Illumina GAII for our genome project (mostly microbial genome and transcriptome sequencing) and would like people's experience on assembly tools, especially with using hybrid approach such as 454/Illumina, Sanger/Illumina, etc. With so many tools out there, can someone suggest their favorite de novo / aligner and reasons of their choice?

Recently I tried a new short read aligner called Bowtie (http://bowtie-bio.sourceforge.net/) designed for fast mapping of Illumina reads. It was developed by
Steven Salzberg's group at University of Maryland and is claimed to be 10 times faster than MAQ. Bowtie is open source and comes with a script to convert its output to use MAQ's downstream SNP tools.

I tried it and it was pretty easy to use and it WAS fast. I am wondering if anybody else had tried this tool and can share with pros and cons of this tool compared with other existing tools.

Thanks!
Jerry Liu
jerryliu is offline  
Old 10-20-2008, 03:48 PM   #79
myrna
Member
 
Location: Vancouver, Canada

Join Date: Feb 2008
Posts: 44
Default Bowtie

Hi Jerry.
How well (if at all) are indels handled by Bowtie? I have found Novoalign to handle gaps better than Maq and SOAP. By the way, I am looking forward to seeing the release of TopHat.
Regards,

Ryan
myrna is offline  
Old 10-20-2008, 04:25 PM   #80
jerryliu
Junior Member
 
Location: Maryland

Join Date: Aug 2008
Posts: 4
Default

Hi Ryan,

According to its manual (http://bowtie-bio.sourceforge.net/manual.html), it currently does not support indels, PE reads, or ABI color space.

Jerry
jerryliu is offline  
Closed Thread

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:17 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO