View Single Post
Old 03-10-2010, 05:42 AM   #4
seb567
Senior Member
 
Location: Québec, Canada

Join Date: Jul 2008
Posts: 260
Default Ray -- questions & answers!

[Q] It says "if your sff file contains paired-end reads, you must first extract the information, and tell Ray to use them with LoadPairedEndReads". Do you mean we should extract as FASTA with sffinfo, remove the linker, and create a .fasta/.fastq file?

That is right. Ray don't create paired-end reads from SFF file.

[Q] Is "OpenAssembler" the same software as "Ray" ?

No, but Ray is a parallel implementation of the OpenAssembler algorithm. The paper describing OpenAssembler is still under review (submitted on 15 October 2009...), and one of its weaknesses is that it is not parallel, thus not scalable. So, I started coding Ray (started on 2010-01-21), and I decided to put it on the web to get feedbacks.

[Q] "OpenAssembler assembles Illumina reads or 454 + Illumina reads, or any combination without non-random error incorporation.". Can you explain what you mean by "random error incorporation" ?

When an error occurs, it should occur randomly. The 454 homopolymer errors are not randomly observed, they occur in homopolymer stretches more often. In the OpenAssembler paper (under review since 15 October 2009) we show however that Illumina's error incorporation is random, and that 454+Illumina also has random error incorporation. The take-home message is that randomly incorporated errors are easy to detect and fix, whereas reproducible errors are defective-by-design.

Illumina errors are distributed on all the read, with more observed errors at the end. 454 errors are mosty related to homopolymers, for instance you will observe both ATCTAGCAAAAATACGCAT and ATCTAGCAAAAAATACGCAT with the same abundance (notice the length of AAAAAs).

[Q] How critical are the values of "<fragmentLength>" and "<fragmentLengthStandardDeviation>" to the assembly? Are they just starting points for estimating the true value?

You should know the true values before running Ray. For instance, the SRA001125 dataset contains paired-end Illumina reads for E. coli K12 MG1655. Usually, if you have paired-end data, you should know the geometry (fragmentLength+deviation) of your reads.

an example of that:

[boiseb01@ls30 SRA001125]$ echo "LoadPairedEndReads 200xSRR001665_1.fastq 200xSRR001665_2.fastq 215 20
LoadPairedEndReads 200xSRR001666_1.fastq 200xSRR001666_2.fastq 215 20" > input
[boiseb01@ls30 SRA001125]$ /home/boiseb01/software/ompi-1.4.1-gcc/bin/mpirun -np 31 /home/boiseb01/Ray/trunk/Ray ./input |tee Log
[boiseb01@ls30 SRA001125]$ ls -l Contigs.fasta
-rw-rw-r--. 1 boiseb01 boiseb01 4710363 2010-03-09 17:01 Contigs.fasta
[boiseb01@ls30 SRA001125]$ grep '>' Contigs.fasta |wc -l
224

As such, we get 224 >=100-nt bits for this small bug.

If you provide paired-end reads, you need to provide accurate values for <fragmentLength> and <fragmentLengthStandardDeviation>.

[Q] Does "Ray" use the quality values in the FASTQ file for anything?

No, Ray auto-calibrates itself using abundance of k-mers.

[Q] What does "Ray" do if I provide it with really long sequences, such as contigs from another assembly?

Try it!, I don't know.

My benchmarks so far include:

* SRA001125 paired (E. coli k12 MG1655, Illumina data)
* S. pneumoniae R6 50-nt reads, 50 X
* S. pneumoniae R6 50-nt reads, 50 X, 1% random mismatches
* E. coli k12 MG1655, 400-nt reads, 50 X
* Human chromosome 1, 50-nt reads
* Pseudomonas aeruginosa, 50-nt reads, 50 X

[Q] Does it support colorspace?

Currently, only fasta, fastq, and SFF.

As I understand, there is a bijection between strings from {A,T,C,G} and reads from {0,1,2,3}, with each color corresponding to a nucleotide given the previous one. I have not look into that yet, but I don't think the algorithm is going to change a lots with that taken into consideration.

My guest is that one could do the assembly in color-space (the alphabet size is 4 too), and then convert the color-space contigs to nucleotide-space.



I hope it helps!


***
The Ray Project Team
http://denovoassembler.sf.net/
seb567 is offline   Reply With Quote