SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
parallel de novo assembler tmy1018 Bioinformatics 3 10-22-2012 08:31 AM
PubMed: A Comparison of Parallel Pyrosequencing and Sanger Clone-Based Sequencing and Newsbot! Literature Watch 0 11-01-2011 02:00 AM
Contrail - a hadoop-based de novo sequence assembler samanta General 0 09-08-2011 11:16 AM
looking for reference genome based assembler for short-reads zchou Bioinformatics 3 12-16-2009 08:13 PM
PubMed: ABySS: A parallel assembler for short read sequence data. Newsbot! Literature Watch 0 03-03-2009 05:00 AM

Reply
 
Thread Tools
Old 03-09-2010, 06:13 PM   #1
seb567
Senior Member
 
Location: Québec, Canada

Join Date: Jul 2008
Posts: 260
Exclamation Ray: a NEW MPI-based 100% parallel genome assembler

Dear SeqAnswers:

The Ray Project Team gives you a 100% parallel MPI-based assembler called Ray. Ray is NOW available at http://sourceforge.net/projects/denovoassembler/files/. It supports Illumina paired-end reads. It is 100% parallel, and it is a single executable (no pesky perl scripts!). The source code is licensed with the GPL-v3.

Try it, and give us your comments, bugs, suggestions, and concerns on our mailing list.
http://lists.sourceforge.net/lists/l...ssembler-users

Ray-0.0.3: a NEW MPI-based parallel genome assembler
http://sourceforge.net/mailarchive/f...ssembler-users

***
The Ray Project Team
http://denovoassembler.SourceForge.net/
seb567 is offline   Reply With Quote
Old 03-09-2010, 09:44 PM   #2
Torst
Senior Member
 
Location: The University of Melbourne, AUSTRALIA

Join Date: Apr 2008
Posts: 275
Default

Quote:
Originally Posted by seb567 View Post
The Ray Project Team gives you a 100% parallel MPI-based assembler called Ray. Ray is NOW available at http://sourceforge.net/projects/denovoassembler/files/. It supports Illumina paired-end reads. It is 100% parallel, and it is a single executable (no pesky perl scripts!). The source code is licensed with the GPL-v3.
The README.txt is confusing in some of the sections. I hope you can help me clarify them.

[Q] It says "if your sff file contains paired-end reads, you must first extract the information, and tell Ray to use them with LoadPairedEndReads". Do you mean we should extract as FASTA with sffinfo, remove the linker, and create a .fasta/.fastq file?

[Q] Is "OpenAssembler" the same software as "Ray" ?

[Q] "OpenAssembler assembles Illumina reads or 454 + Illumina reads, or any combination without non-random error incorporation.". Can you explain what you mean by "random error incorporation" ?

[Q] How critical are the values of "<fragmentLength>" and "<fragmentLengthStandardDeviation>" to the assembly? Are they just starting points for estimating the true value?

[Q] Does "Ray" use the quality values in the FASTQ file for anything?

[Q] What does "Ray" do if I provide it with really long sequences, such as contigs from another assembly?

Thank you for your time,

Torsten

Last edited by Torst; 03-09-2010 at 09:46 PM. Reason: Added two more questions.
Torst is offline   Reply With Quote
Old 03-09-2010, 11:38 PM   #3
KevinLam
Senior Member
 
Location: SEA

Join Date: Nov 2009
Posts: 197
Default

Does it support colorspace?
KevinLam is offline   Reply With Quote
Old 03-10-2010, 05:42 AM   #4
seb567
Senior Member
 
Location: Québec, Canada

Join Date: Jul 2008
Posts: 260
Default Ray -- questions & answers!

[Q] It says "if your sff file contains paired-end reads, you must first extract the information, and tell Ray to use them with LoadPairedEndReads". Do you mean we should extract as FASTA with sffinfo, remove the linker, and create a .fasta/.fastq file?

That is right. Ray don't create paired-end reads from SFF file.

[Q] Is "OpenAssembler" the same software as "Ray" ?

No, but Ray is a parallel implementation of the OpenAssembler algorithm. The paper describing OpenAssembler is still under review (submitted on 15 October 2009...), and one of its weaknesses is that it is not parallel, thus not scalable. So, I started coding Ray (started on 2010-01-21), and I decided to put it on the web to get feedbacks.

[Q] "OpenAssembler assembles Illumina reads or 454 + Illumina reads, or any combination without non-random error incorporation.". Can you explain what you mean by "random error incorporation" ?

When an error occurs, it should occur randomly. The 454 homopolymer errors are not randomly observed, they occur in homopolymer stretches more often. In the OpenAssembler paper (under review since 15 October 2009) we show however that Illumina's error incorporation is random, and that 454+Illumina also has random error incorporation. The take-home message is that randomly incorporated errors are easy to detect and fix, whereas reproducible errors are defective-by-design.

Illumina errors are distributed on all the read, with more observed errors at the end. 454 errors are mosty related to homopolymers, for instance you will observe both ATCTAGCAAAAATACGCAT and ATCTAGCAAAAAATACGCAT with the same abundance (notice the length of AAAAAs).

[Q] How critical are the values of "<fragmentLength>" and "<fragmentLengthStandardDeviation>" to the assembly? Are they just starting points for estimating the true value?

You should know the true values before running Ray. For instance, the SRA001125 dataset contains paired-end Illumina reads for E. coli K12 MG1655. Usually, if you have paired-end data, you should know the geometry (fragmentLength+deviation) of your reads.

an example of that:

[boiseb01@ls30 SRA001125]$ echo "LoadPairedEndReads 200xSRR001665_1.fastq 200xSRR001665_2.fastq 215 20
LoadPairedEndReads 200xSRR001666_1.fastq 200xSRR001666_2.fastq 215 20" > input
[boiseb01@ls30 SRA001125]$ /home/boiseb01/software/ompi-1.4.1-gcc/bin/mpirun -np 31 /home/boiseb01/Ray/trunk/Ray ./input |tee Log
[boiseb01@ls30 SRA001125]$ ls -l Contigs.fasta
-rw-rw-r--. 1 boiseb01 boiseb01 4710363 2010-03-09 17:01 Contigs.fasta
[boiseb01@ls30 SRA001125]$ grep '>' Contigs.fasta |wc -l
224

As such, we get 224 >=100-nt bits for this small bug.

If you provide paired-end reads, you need to provide accurate values for <fragmentLength> and <fragmentLengthStandardDeviation>.

[Q] Does "Ray" use the quality values in the FASTQ file for anything?

No, Ray auto-calibrates itself using abundance of k-mers.

[Q] What does "Ray" do if I provide it with really long sequences, such as contigs from another assembly?

Try it!, I don't know.

My benchmarks so far include:

* SRA001125 paired (E. coli k12 MG1655, Illumina data)
* S. pneumoniae R6 50-nt reads, 50 X
* S. pneumoniae R6 50-nt reads, 50 X, 1% random mismatches
* E. coli k12 MG1655, 400-nt reads, 50 X
* Human chromosome 1, 50-nt reads
* Pseudomonas aeruginosa, 50-nt reads, 50 X

[Q] Does it support colorspace?

Currently, only fasta, fastq, and SFF.

As I understand, there is a bijection between strings from {A,T,C,G} and reads from {0,1,2,3}, with each color corresponding to a nucleotide given the previous one. I have not look into that yet, but I don't think the algorithm is going to change a lots with that taken into consideration.

My guest is that one could do the assembly in color-space (the alphabet size is 4 too), and then convert the color-space contigs to nucleotide-space.



I hope it helps!


***
The Ray Project Team
http://denovoassembler.sf.net/
seb567 is offline   Reply With Quote
Old 03-10-2010, 09:55 AM   #5
bioinfosm
Senior Member
 
Location: USA

Join Date: Jan 2008
Posts: 482
Default

Have you had a chance to compare it to the existing assemblers? both for accuracy and time, and also memory?
__________________
--
bioinfosm
bioinfosm is offline   Reply With Quote
Old 03-10-2010, 10:24 AM   #6
seb567
Senior Member
 
Location: Québec, Canada

Join Date: Jul 2008
Posts: 260
Default

Ray is based on the OpenAssembler algorithm, and we did compare OpenAssembler with ABySS, EULER, Newbler, and Velvet for our submitted paper on OpenAssembler (submitted on 15 October 2009).

We compared on SRA001125, an Illumina paired-end reads dataset, and SRA003611, a mix of 454 and Illumina reads, as well as many simulated datasets (with and without randomly incorporated errors). The conclusion was that only OpenAssembler can mix technologies, and that OpenAssembler is the best on Illumina paired and non-paired data.
EULER was the worst (the very worst!) in my benchmarks. Velvet was very good, and Newbler was the best on 454 (only Newbler worked with 454 in my benchmarks)
Because OpenAssembler auto-learns from the data instead of trying to figure out the statistics like in Velvet (they created VelvetOptimizer to alleviate that shortcoming!), we think the usability is better.

Virtually, if errors are incorporated randomly (Illumina, and SOLiD), and if the coverage is rather uniform (any technology, I think), then we have strong theoretical support to say that the conservative approach of OpenAssembler disallows any miassemblies, (chimeric contigs), but we observed some mismatches. This theoretical support is provided by a set of rules, heuristics, and some invariants.

Tired of waiting for the reviewing process, I decided to start Ray and release its source code as soon as possible!

Accuracy:

OpenAssembler does not produce chimeric contigs, but produces some mismatches when errors are present in reads (28 mismatches for SRA001125!). On SRA001125, Ray produces no chimeric contigs, and a few mismatches.

Time:

For SRA001125, with all the reads from NCBI SRA, it takes about 30 minutes on 31 MPI processes. Human chromosome 1, with 50-nt reads at 50 X takes about 2 hours on 400 MPI processes (Itanium, Infiniband).

Memory:

Ray is gentle with the memory usage. It uses SplayTree (who uses them anyway??). In a splay tree, the keys accessed often are near the root whereas keys accessed a few times will be in the leaves.
Ray distributes everything on MPI processes: reads, paired-end linkages, vertices, arcs, seeds, extensions, fusions, finished fusions. To communicate, Ray utilizes about 90 message types!, so Ray instances like to communicate!

If you want to know about memory usage, check Vertex.h. The coverage is stored on a uint8_t, edges are stored on a uint8_t, and there are some linked lists too.

In Ray, there is no tip cutting, and no bubble popping, which makes it a very different approach in comparison with Velvet/soapdenovo/ABySS/EULER.

But remember, a genome assembler is like an interpreter (python/perl/ruby), and its execution depends on the program (the reads) you give it, so you can't really summarise things that much.

Enjoy Ray!

***
The Ray Project Team
http://denovoassembler.sf.net/
seb567 is offline   Reply With Quote
Old 03-10-2010, 10:51 AM   #7
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by seb567 View Post
[Q] Does it support colorspace?

Currently, only fasta, fastq, and SFF.

As I understand, there is a bijection between strings from {A,T,C,G} and reads from {0,1,2,3}, with each color corresponding to a nucleotide given the previous one. I have not look into that yet, but I don't think the algorithm is going to change a lots with that taken into consideration.

My guest is that one could do the assembly in color-space (the alphabet size is 4 too), and then convert the color-space contigs to nucleotide-space.
SOLiD will require some thought as errors, variants and combinations thereof manifest differently with respect to each other. Hopefully you will embrace SOLiD data as many groups/labs are clamoring for an easy and powerful assembler for SOLiD data.
nilshomer is offline   Reply With Quote
Old 03-10-2010, 06:18 PM   #8
KevinLam
Senior Member
 
Location: SEA

Join Date: Nov 2009
Posts: 197
Default

Quote:
Originally Posted by nilshomer View Post
SOLiD will require some thought as errors, variants and combinations thereof manifest differently with respect to each other. Hopefully you will embrace SOLiD data as many groups/labs are clamoring for an easy and powerful assembler for SOLiD data.
AGREED
AFAIK, they do not have their own assembler but rely on conversion scripts to feed into Velvet. Which we all know is very memory hungry.
KevinLam is offline   Reply With Quote
Old 03-11-2010, 10:40 AM   #9
bioinfosm
Senior Member
 
Location: USA

Join Date: Jan 2008
Posts: 482
Default

Thanks seb567 for the detailed response.

One point I missed earlier is the approx coverage of data recommended by Ray for Solexa data. I know velvet recommends ~40x, and is not efficient with more or less coverage.

Also, I heard MIRA is an assembler that combines various read technologies data..
__________________
--
bioinfosm
bioinfosm is offline   Reply With Quote
Old 03-11-2010, 05:37 PM   #10
Mizzou55
Junior Member
 
Location: st. louis

Join Date: Mar 2010
Posts: 7
Default

Any sort of limit to number of Illumina reads Ray can handle? We were going to try it on a 200 Mb worm that's repetitive and has high heterozygosity. What do you think, too big?
Mizzou55 is offline   Reply With Quote
Old 03-11-2010, 06:02 PM   #11
seb567
Senior Member
 
Location: Québec, Canada

Join Date: Jul 2008
Posts: 260
Default

@bioinfosm The SRA001125 dataset has about 109 X coverage. I think something between 30 and 100 is adequate for Illumina data.

@Mizzou55 You will need paired-end reads. What is your read length? Fragment length? You can handle as much as you can with the available distributed memory. Please note that you need Open-MPI, not MPICH2 or MVAPICH because these libraries are crashing whereas Open-MPI does not. Ray MPI processes always send small messages, and Open-MPI always sends small messages eagerly, but MPICH2-based MPI implementations apparently lack that behavior. For the high heterozygosity, Ray does not support that right now, because Ray currently sees this as non-random error incorporation. I am currently working on color-space for the next upcoming release version 0.0.4, but heterozygosity is the next feature I will add.


Thanks!

**
The Ray Project Team
http://denovoassembler.sf.net/
seb567 is offline   Reply With Quote
Old 03-12-2010, 08:13 AM   #12
Mizzou55
Junior Member
 
Location: st. louis

Join Date: Mar 2010
Posts: 7
Default

For the worm genome we have 100bp reads and two inserts sizes; 300 and 400 PE's. We will have 30-40X. Assuming the heterozygosity issue is resolved you would anticipate better results than SOAP or Abyss with this data input?
Mizzou55 is offline   Reply With Quote
Old 03-22-2010, 06:29 PM   #13
seb567
Senior Member
 
Location: Québec, Canada

Join Date: Jul 2008
Posts: 260
Default

@Mizzou55: I don't know, honestly, if your data are better assembled with a specific tool.

Last edited by seb567; 03-22-2010 at 06:33 PM. Reason: spelling
seb567 is offline   Reply With Quote
Old 03-22-2010, 06:31 PM   #14
seb567
Senior Member
 
Location: Québec, Canada

Join Date: Jul 2008
Posts: 260
Thumbs up

Dear SeqAnswers.com community:

Ray 0.0.4 is now available for download.

Changes:

https://sourceforge.net/mailarchive/...ssembler-users

Download Ray 0.0.4:

https://sourceforge.net/projects/den...r.bz2/download

Thank you.

***
The Ray Project Team
http://denovoassembler.sf.net/
seb567 is offline   Reply With Quote
Old 03-22-2010, 07:59 PM   #15
KevinLam
Senior Member
 
Location: SEA

Join Date: Nov 2009
Posts: 197
Default

NIce, SOLiD support is in already.
But darn on CentOS 5.4
(Open MPI) is version 1.3.2 so I had compile errors. still messing around with it.
KevinLam is offline   Reply With Quote
Old 03-23-2010, 03:38 AM   #16
seb567
Senior Member
 
Location: Québec, Canada

Join Date: Jul 2008
Posts: 260
Default

@KevinLam

Indeed, I started the development for color space using these datasets:

http://solidsoftwaretools.com/gf/project/dh10bfrag/
http://solidsoftwaretools.com/gf/project/ecoli2x50/

However, these data contain too many errors (in color space) to be assembled de novo (in color space), in my opinion. My estimation is that the error rate in color space ranges from 8% to 12% for these two datasets. That would explain the total lack of de novo assemblies performed so far with SOLiD technology.

So, you are free to try Ray with csfasta files, but it is not 100% tested yet.

Perhaps the last version of the SOLiD sequencer produces more reliable readouts, but that I don't know. And I am sure someone else is more aware of that than me on SeqAnswers.com.

Thank you, happy assembly!

***
The Ray Project Team
http://denovoassembler.sf.net/
seb567 is offline   Reply With Quote
Old 03-29-2010, 06:41 AM   #17
seb567
Senior Member
 
Location: Québec, Canada

Join Date: Jul 2008
Posts: 260
Talking

Dear Ray enthusiasts:


Ray 0.0.5 is now available with these new features:

* Ray now outputs assemblies in AMOS format (with -a),
* Ray commands can be provided with a commands file (like in 0.0.3 and 0.0.4) as well as with command-line arguments, and
* Ray removes non-A-T-C-G letters at both ends of reads.

About Ray:

Ray is a computer-controlled software that perform parallel de novo genome assemblies of next-gen sequencing data using message passing interface. It uses an assembly engine called Parallel_Ray_Engine.

Download Ray 0.0.5: https://sourceforge.net/projects/den...r.bz2/download

Mailing list: https://lists.sourceforge.net/lists/...ssembler-users

Statistics:

Ray 0.0.3 downloads since 2010-03-09: 63
Ray 0.0.4 downloads since 2010-03-22: 23
SeqAnswers Thread Views since 2010-03-09: 767

Tests results (2010-03-28-3159-1): https://sourceforge.net/mailarchive/...ssembler-users
seb567 is offline   Reply With Quote
Old 03-31-2010, 11:19 PM   #18
sparks
Senior Member
 
Location: Kuala Lumpur, Malaysia

Join Date: Mar 2008
Posts: 126
Default Colour space Alignmnet

Hi Kevin,
I had a quick look at your code for colour space and I think you need to skip the first colour as well as the leading primer base on each read as the first colour is made by primer base plus first base of the fragment. If you leave the first colour on it will add an extra error into 3/4 reads.

ColourSpaceLoader.cpp:63 t->copy(NULL,bufferForLine+2,readMyAllocator);// remove the leading T & first colour

Colin
sparks is offline   Reply With Quote
Old 04-01-2010, 04:51 AM   #19
seb567
Senior Member
 
Location: Québec, Canada

Join Date: Jul 2008
Posts: 260
Default

Dear sparks,

You are right. I changed +1 to +2 to skip the first color too.

p.s.: I (Sébastien Boisvert) developed Ray.
seb567 is offline   Reply With Quote
Old 04-01-2010, 06:44 AM   #20
sparks
Senior Member
 
Location: Kuala Lumpur, Malaysia

Join Date: Mar 2008
Posts: 126
Default Colour Space

Hi Sebastien,
My apologies re name mix up. We have two lanes of 50bp PE from a bacteria to assemble in next few weeks so well give Ray a try. I'm thinking assembly in colour space isn't much different to that in nucleotide space but after CS assembly we need to convert back to Nucleotide. This could mean remembering first colour of all the reads and their positions in the contigs as first colour and primer base gives a reference for conversion. Are you doing this?
Thanks for giving us Ray. We'll let you know how it goes.
Colin

Quote:
Originally Posted by seb567 View Post
Dear sparks,

You are right. I changed +1 to +2 to skip the first color too.

p.s.: I (Sébastien Boisvert) developed Ray.
sparks is offline   Reply With Quote
Reply

Tags
assembler, genome, illumina, mix

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:46 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO