View Single Post
Old 03-10-2010, 10:24 AM   #6
Senior Member
Location: Québec, Canada

Join Date: Jul 2008
Posts: 260

Ray is based on the OpenAssembler algorithm, and we did compare OpenAssembler with ABySS, EULER, Newbler, and Velvet for our submitted paper on OpenAssembler (submitted on 15 October 2009).

We compared on SRA001125, an Illumina paired-end reads dataset, and SRA003611, a mix of 454 and Illumina reads, as well as many simulated datasets (with and without randomly incorporated errors). The conclusion was that only OpenAssembler can mix technologies, and that OpenAssembler is the best on Illumina paired and non-paired data.
EULER was the worst (the very worst!) in my benchmarks. Velvet was very good, and Newbler was the best on 454 (only Newbler worked with 454 in my benchmarks)
Because OpenAssembler auto-learns from the data instead of trying to figure out the statistics like in Velvet (they created VelvetOptimizer to alleviate that shortcoming!), we think the usability is better.

Virtually, if errors are incorporated randomly (Illumina, and SOLiD), and if the coverage is rather uniform (any technology, I think), then we have strong theoretical support to say that the conservative approach of OpenAssembler disallows any miassemblies, (chimeric contigs), but we observed some mismatches. This theoretical support is provided by a set of rules, heuristics, and some invariants.

Tired of waiting for the reviewing process, I decided to start Ray and release its source code as soon as possible!


OpenAssembler does not produce chimeric contigs, but produces some mismatches when errors are present in reads (28 mismatches for SRA001125!). On SRA001125, Ray produces no chimeric contigs, and a few mismatches.


For SRA001125, with all the reads from NCBI SRA, it takes about 30 minutes on 31 MPI processes. Human chromosome 1, with 50-nt reads at 50 X takes about 2 hours on 400 MPI processes (Itanium, Infiniband).


Ray is gentle with the memory usage. It uses SplayTree (who uses them anyway??). In a splay tree, the keys accessed often are near the root whereas keys accessed a few times will be in the leaves.
Ray distributes everything on MPI processes: reads, paired-end linkages, vertices, arcs, seeds, extensions, fusions, finished fusions. To communicate, Ray utilizes about 90 message types!, so Ray instances like to communicate!

If you want to know about memory usage, check Vertex.h. The coverage is stored on a uint8_t, edges are stored on a uint8_t, and there are some linked lists too.

In Ray, there is no tip cutting, and no bubble popping, which makes it a very different approach in comparison with Velvet/soapdenovo/ABySS/EULER.

But remember, a genome assembler is like an interpreter (python/perl/ruby), and its execution depends on the program (the reads) you give it, so you can't really summarise things that much.

Enjoy Ray!

The Ray Project Team
seb567 is offline   Reply With Quote