SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
parallel de novo assembler tmy1018 Bioinformatics 3 10-22-2012 08:31 AM
PubMed: A Comparison of Parallel Pyrosequencing and Sanger Clone-Based Sequencing and Newsbot! Literature Watch 0 11-01-2011 02:00 AM
Contrail - a hadoop-based de novo sequence assembler samanta General 0 09-08-2011 11:16 AM
looking for reference genome based assembler for short-reads zchou Bioinformatics 3 12-16-2009 08:13 PM
PubMed: ABySS: A parallel assembler for short read sequence data. Newsbot! Literature Watch 0 03-03-2009 05:00 AM

Reply
 
Thread Tools
Old 04-01-2010, 07:19 AM   #21
seb567
Senior Member
 
Location: Québec, Canada

Join Date: Jul 2008
Posts: 260
Default

Hi sparks,

I am glad that Ray sparks interest.

Ray is not ready yet for color space. Ray loads color-space reads and builds a distributed de Bruijn graph in color space and compute paths in that graph. The algorithm is pretty much the same, except that in color space, the reverse-complement is simply the reverse (AA and TT have the same color). But I have not implemented the conversion back to nucleotides yet because I have not figured out which starting base to use for decoding color-encoded paths.

In particular, these questions remain unanswered regarding color space:

Q1) If all color-space reads use T, does that mean the decoding is done with T?

Q2) If some (color-space) reads start with T, while others use A, how do I sort things out?

Q3) What is the error (mismatch errors) rate of the numerous versions of the SOLiD appliance?

Thanks!

***
Sébastien Boisvert
The Ray Project Team
http://denovoassembler.sf.net/
seb567 is offline   Reply With Quote
Old 04-01-2010, 08:02 AM   #22
sparks
Senior Member
 
Location: Kuala Lumpur, Malaysia

Join Date: Mar 2008
Posts: 126
Default

Hi Sébastien,
For Q1&2, the primer base and first colour define first base of read so you need to keep this for every read along with where the read started in the contig. With some luck the contigs would consistent with the first bases and so if you start code conversion at one read start then all the rest will match but I expect this might not work in practice (error in first colour) so maybe try a sliding window that selects conversion that matches the most first bases.
I haven't any experience of error rate yet.

Colin
sparks is offline   Reply With Quote
Old 04-01-2010, 09:35 AM   #23
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by sparks View Post
Hi Sébastien,
For Q1&2, the primer base and first colour define first base of read so you need to keep this for every read along with where the read started in the contig. With some luck the contigs would consistent with the first bases and so if you start code conversion at one read start then all the rest will match but I expect this might not work in practice (error in first colour) so maybe try a sliding window that selects conversion that matches the most first bases.
I haven't any experience of error rate yet.

Colin
You can also normalize the color reads to have the same starting adapter (say A). You convert the adapter and first color appropriately. You will then only need to store the first color.

Code:
original: T0010100
base: TTGGTTT
normalized: A3010100
nilshomer is offline   Reply With Quote
Old 04-01-2010, 12:46 PM   #24
seb567
Senior Member
 
Location: Québec, Canada

Join Date: Jul 2008
Posts: 260
Default

Nils Homer: Correct me if I am wrong, but decoding the color-space read in a nucleotide representation will impede the meaning of the bits if at least one color is erroneous.

Edit: as Sparks suggested, one can simply discard the starting base and the first color (in your exemple T0010100 becomes 010100). But then, which (A or T or C or G) base should be utilized for decoding paths produced by Ray's algorithm? Thanks a lot for your expertise with the SOLiD sequencing technology!

Last edited by seb567; 04-01-2010 at 12:51 PM. Reason: added a point (indicated by 'Edit:')
seb567 is offline   Reply With Quote
Old 04-01-2010, 01:16 PM   #25
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by seb567 View Post
Nils Homer: Correct me if I am wrong, but decoding the color-space read in a nucleotide representation will impede the meaning of the bits if at least one color is erroneous.
The normalization procedure above produces the read back in color space, so proper base space decoding can happen later. But what it really does that is useful is to make all the reads have the same starting adapter. If you are worried about storing the first base and color for each color read, now you can normalize the color space read and then only have to store the first color. Both original and normalized color space read produce the same base sequence, and therefore are equivalent encodings.

You are right that in the final alignment or assembly, naively decoding the color space read without identifying the sequencing errors will cause incorrect bases after the sequencing error. However, most color space aligners do, and in this case your assembler should, identify the sequencing errors as part of the alignment and in final result.
nilshomer is offline   Reply With Quote
Old 04-01-2010, 04:41 PM   #26
sparks
Senior Member
 
Location: Kuala Lumpur, Malaysia

Join Date: Mar 2008
Posts: 126
Default It's equivalent

primer base + colour = 1st base = "A" + normalised colour --- requires 2 bits storage per read

Quote:
Originally Posted by nilshomer View Post
The normalization procedure above produces the read back in color space, so proper base space decoding can happen later. But what it really does that is useful is to make all the reads have the same starting adapter. If you are worried about storing the first base and color for each color read, now you can normalize the color space read and then only have to store the first color. Both original and normalized color space read produce the same base sequence, and therefore are equivalent encodings.

You are right that in the final alignment or assembly, naively decoding the color space read without identifying the sequencing errors will cause incorrect bases after the sequencing error. However, most color space aligners do, and in this case your assembler should, identify the sequencing errors as part of the alignment and in final result.
sparks is offline   Reply With Quote
Old 04-26-2010, 06:33 AM   #27
seb567
Senior Member
 
Location: Québec, Canada

Join Date: Jul 2008
Posts: 260
Cool Ray 0.0.7 compares very favorably with available short-read paired assemblers

Dear appreciated SEQanswers community:

Parallel software for parallel sequencing technologies

Ray 0.0.7 -- a computer-controlled software that perform parallel de novo genome assemblies of next-gen sequencing data using message passing interface -- is now available for download.

Download Ray 0.0.7: http://sourceforge.net/projects/deno...r.bz2/download
Wiki page: http://sourceforge.net/apps/mediawik...itle=Main_Page
Do-it-yourself examples: http://sourceforge.net/apps/mediawik...rself_examples
Review changes: http://sourceforge.net/apps/mediawik...?title=Changes
Mailing list: http://lists.sourceforge.net/lists/l...ssembler-users

Less contigs with Roche/454 and Illumina reads

We are delighted to report to SEQanswers that Ray 0.0.7 with Roche/454 and Illumina reads outperforms Newbler on Roche/454 reads systematically on three public datasets. Specifically, Ray computes less contigs with less errors while covering must of the coverable genome.

Review numbers: http://sourceforge.net/apps/mediawik..._for_Ray_0.0.7

de novo assembly with Illumina -- because outstanding quality and practical cost matter

Ray 0.0.7 also crushes the competition on Illumina unpaired and paired public datasets. Ray also outperforms on simulated data -- but these are not very useful outside assembler development.

Review comparisons: http://sourceforge.net/apps/mediawik..._for_Ray_0.0.7

Scientific paper on its way

For those (numerous?) people looking for a Ray paper: I am working on my revised manuscript.

Conflicts of interest

None

Acknowledgments

This project is funded by the Canadian Institutes of Health Research (Institute of Genetics).

More information: http://sourceforge.net/apps/mediawik...cknowledgments



Thank you,

make this day an open assembly day!

-seb

---
Mr. Sébastien Boisvert
on the behalf of the Ray Project Team
http://denovoassembler.sf.net/
seb567 is offline   Reply With Quote
Old 05-16-2010, 07:07 AM   #28
francesco.vezzi
Member
 
Location: Udine (Italy)

Join Date: Jan 2009
Posts: 50
Default Ray and genome size

Hi Seb
your assembler seems really promising. I was wondering if it able to work also with plant and animals genomes that have the problem to be really long (Gigabases) and to have really long repeats.

One of the point of strength of SOAPdenovo and ABySS is their ability to assemble really complex genomes like the human one. If I'm not wrong your benchmarks are made "only" on small genomes.

Thanks
Francesco
francesco.vezzi is offline   Reply With Quote
Old 05-17-2010, 07:09 AM   #29
seb567
Senior Member
 
Location: Québec, Canada

Join Date: Jul 2008
Posts: 260
Default Larger genomes -- not yet but coming soon!



Dear Mr. Francesco Vezzi, and SEQanswers great community,

First, you are right to say that Ray is currently benchmarked openly and only on small genomes.

In my roadmap, I am waiting for a paper to get published to continue my effort on larger genomes (the publish or perish thing).
I will send my revised form hopefully in the next days when I get OKs from co-authors.

Next thing (after the paper thing) is to help decode larger genomes --
but it's hard to find the reads that goes with a larger genome (and the reference).

You can't do much with just raw reads from an otherwise un-sequenced/assembled entity.
N50 is cool, but it is not a critical assessment metric, it is just a number everyone
blindly maximises.

The community reported that our benchmarks are only on small genomes
( http://seqanswers.com/forums/showthr...8643#post18643 ).
We are currently working on the matter (larger genomes).
Ray can handle them if hardware requirements are met (InfiniBand,
memory, and processors), but it is not extensively tested and they probably need accommodation.

Most assemblers (Velvet, EULER-SR, amongst others) sacrify sequence quality for N50, at least that is what I understand from my open benchmarks.

In the early ages and stages of short-reads assemblers, greedy approaches were at the crux of their
behaviours -- greed is locally good, but can be globally bad (SSAKE, VCAKE, and SHARGCS). They were evaluated with mostly nothing but N50 measurement.




If you ask "What's N50 anyway?":

"The N50 size is computed by sorting all contigs from largest to smallest and by
determining the minimum set of contigs whose sizes total 50% of the entire genome.
The N50 size is the [one of the] smallest contig in that set."

Source: Bioinformatics 2005 http://dx.doi.org/doi:10.1093/bioinformatics/bti769

You might want to read this (very short) paper above to get acquainted with missassemblies.







Not to get off-topic, but the greed thing is general.

Greed is locally good but globally [VERY] bad -- here are three examples with references:


(1)

Research funding is good for academic careers, think-tanks, (locally good) but apparently not good enough for healthcare patients (globally bad).

Too fundamental, not enough translational, they say.

==> http://www.nature.com/news/2010/1005....2010.243.html
==> http://www.newsweek.com/id/238078


(2)

Finance powerhouse makes money (greed is locally good for them, they can buy food, cars, houses, and lobbies), but wrecks the world economy (globally VERY bad).

==> http://news.bbc.co.uk/2/hi/business/8625931.stm
==> http://money.cnn.com/2010/04/16/news...ldman.fortune/


(3)

Drilling for oil is financially sustainable (locally good for energy and economy), but [VERY] bad for almost everything else when disasters show up.

==> http://www.cbc.ca/world/story/2010/0...oil-spill.html
==> http://www.reuters.com/article/idUSTRE64D69K20100514





So as the title goes by: larger genomes -- not yet but coming [VERY] soon!


Thanks and cheers!

************
Mr. Sébastien M. Boisvert, first-year PhD student, http://boisvert.info/
The Ray Project Team, http://denovoassembler.sf.net/

seb567 is offline   Reply With Quote
Old 05-18-2010, 01:42 PM   #30
DeNovoG
Junior Member
 
Location: South America

Join Date: May 2010
Posts: 7
Default

Quick questions: does Ray supports illumina 1.6+ fastq sequences (the ones with trailing B's : http://seqanswers.com/forums/showthr...ght=fastq+wiki) does Ray has the capability for trimming low-quality bases or should I pre-process my reads beforehand? should I convert my libraries to Phred/sanger scores? and last but not least can I run Bambus with rays's output? Sorry for so many questions and thank you for any information. BRGDS
DeNovoG is offline   Reply With Quote
Old 05-25-2010, 05:22 AM   #31
seb567
Senior Member
 
Location: Québec, Canada

Join Date: Jul 2008
Posts: 260
Default

Hi,

Question 1

Does Ray support Illumina 1.6+ fastq format (with trailing B's)?

Answer

No, but it should work if the trailing B's are at the end, but not at the beginning.



Question 2

Does Ray have the capability for trimming low-quality bases or should I pre-process my reads beforehand?

Answer

Ray does not trim sequences, but random errors are not a problem.



Question 3

Should I convert my libraries to Phred/Sanger scores?

Answer

No: Ray does not use quality scores.



Question 4

Can I run Bambus with Rays's output?

Answer

Ray outputs a fasta file -- I have never utilized Bambus, so the question is what Bambus needs.


-Sebhtml
seb567 is offline   Reply With Quote
Old 06-04-2010, 03:14 PM   #32
ldong
Member
 
Location: USA

Join Date: May 2010
Posts: 15
Default OpenMPI 1.2.8?

Hi Sebastien,
I followed the instruction in WiKi page and launched Ray about 20 hours ago. The processes are still running. Is this normal? When should it finish? I believe we have enough computing power. Only concern is that we have OpenMPI 1.2.8 in stead of 1.3.4 or 1.4.1. Does Ray work with 1.2.8? Please give some suggestion. Thank you very much for your great work, Best, ldong

== Do-it-yourself examples ==

=== E. coli K-12 MG1655 with Illumina paired-end reads & amos output ===

* Reads
ftp://ftp.ncbi.nlm.nih.gov/sra/stati...665_1.fastq.gz
ftp://ftp.ncbi.nlm.nih.gov/sra/stati...665_2.fastq.gz
ftp://ftp.ncbi.nlm.nih.gov/sra/stati...666_1.fastq.gz
ftp://ftp.ncbi.nlm.nih.gov/sra/stati...666_2.fastq.gz

*vi a file with the following content, Commands.Ray
LoadPairedEndReads SRR001665_1.fastq SRR001665_2.fastq 215 20
LoadPairedEndReads SRR001666_1.fastq SRR001666_2.fastq 215 20
OutputAmosFile

* Command line
mpirun -np 32 Ray Commands.Ray
ldong is offline   Reply With Quote
Old 06-07-2010, 07:28 AM   #33
seb567
Senior Member
 
Location: Québec, Canada

Join Date: Jul 2008
Posts: 260
Default Re: :D

Hi,

@ldong

What is the connectivity? (Infiniband's OK)

Which step does Ray reach before stalling?

Do dots kept being printed ?

If the answer's no, then it might be the spin-lock bug in Open-MPI.

https://svn.open-mpi.org/trac/ompi/ticket/2043 (shoud be fixed in Milestone 1.4.3.)


Anyway, 1.2.8's very old. Current release's 1.4.2!

I also need to optimize communication for "Computing seeds" and later steps in Ray.


Recommendation: upgrade to 1.4.1 or 1.4.2


Cordially,

Sébastien
seb567 is offline   Reply With Quote
Old 06-07-2010, 07:33 AM   #34
seb567
Senior Member
 
Location: Québec, Canada

Join Date: Jul 2008
Posts: 260
Default

@ldong What is your 'computing power'?
seb567 is offline   Reply With Quote
Old 06-08-2010, 08:36 AM   #35
ldong
Member
 
Location: USA

Join Date: May 2010
Posts: 15
Default

Hi, Seb,

Thank you very much for your suggestions. Not sure if we can upgrade openmpi. I will check with system administrator.

We have a few nodes with 16 CPU and 64G memory allowing me to test Ray. Here are what I found:

If I run Ray on two nodes with 32 process. There is always one process on the first node of the host list slowly reaches 25G memory, then gets killed by the system. Other processes never reach 1G.

It seems like 25G is a system limitation. I will ask our administrator. What do you think? Best, ldong
ldong is offline   Reply With Quote
Old 09-16-2010, 01:56 AM   #36
talioto
Junior Member
 
Location: Barcelona, Spain

Join Date: Feb 2010
Posts: 4
Default hang? during "Extending seeds"

I compiled Ray with openmpi 1.4.2, gcc version 4.1.2 20080704 (Red Hat 4.1.2-44), x86_64 architecture and run it with "mpirun -mca btl ^sm". The data is 3 simulated Illumina libraries comprising 52x coverage of a 225MB chromosome: 40x 500bp PE 95nt reads (inward facing), 8x 5kb mate paired 36nt reads (outward facing), 4x 10kb mate paired 36nt reads(outward facing).

Using 128 cores (16 8-core nodes), it runs fine up until the "Extending seeds" step. After a while the printing of the dots seem to slow down to glacial speeds. I've let it sit for several days with no progress. Is this an open mpi problem, you think? Any ideas on getting around this problem?
talioto is offline   Reply With Quote
Old 10-11-2010, 07:28 AM   #37
baihezimu
Junior Member
 
Location: us

Join Date: Sep 2010
Posts: 2
Default

If the paired-end reads are put in the same file, Ray can handle it?
baihezimu is offline   Reply With Quote
Old 10-20-2010, 07:55 AM   #38
seb567
Senior Member
 
Location: Québec, Canada

Join Date: Jul 2008
Posts: 260
Default Replies

@talioto What is the interconnection.

@baihezimu No.
seb567 is offline   Reply With Quote
Old 10-20-2010, 07:55 AM   #39
seb567
Senior Member
 
Location: Québec, Canada

Join Date: Jul 2008
Posts: 260
Default Ray paper is finally available

Sébastien Boisvert, François Laviolette, Jacques Corbeil.
Ray: Simultaneous Assembly of Reads from a Mix of High-Throughput Sequencing Technologies
Journal of Computational Biology
Not available-, ahead of print.
doi:10.1089/cmb.2009.0238

Last edited by seb567; 10-20-2010 at 07:56 AM. Reason: removed linebreak.
seb567 is offline   Reply With Quote
Old 10-27-2010, 08:49 PM   #40
omaha420
Junior Member
 
Location: North America

Join Date: Oct 2010
Posts: 1
Default Congrats on e-publication

Congratulations on Ray's epub!

I learned of your work about 2 weeks ago, familiarized myself with the documentation you've provided, and successfully completed the E. coli assembly with sample data.

As the publication is now complete, could you please provide some more details with respect to your processing of Human Chromosome 1 (under the limitations section of the project website).

Specifically, could you provide the output text for that run so that I could better ascertain:
a. the run time for that assembly on your hardware
b. the version of the Open-MPI Library used in that assembly

I'd like to use Ray for assembly of sequencing reads for eukaryotes... and would like to know what, if any, potential problems to anticipate.

As Open-MPI 1.5 has now been released, I'd like to know if the shared memory problem is still a concern when performing analyses of larger datasets. I believe that this has been fixed in versions > 1.4.1, but would like to know for certain if it is a problem with Ray before spending hours of analysis time on shared hardware.

Thanks for both your work and time in addressing these questions-- your efforts are very much appreciated.
omaha420 is offline   Reply With Quote
Reply

Tags
assembler, genome, illumina, mix

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:29 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO