SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
parallel de novo assembler tmy1018 Bioinformatics 3 10-22-2012 08:31 AM
PubMed: A Comparison of Parallel Pyrosequencing and Sanger Clone-Based Sequencing and Newsbot! Literature Watch 0 11-01-2011 02:00 AM
Contrail - a hadoop-based de novo sequence assembler samanta General 0 09-08-2011 11:16 AM
looking for reference genome based assembler for short-reads zchou Bioinformatics 3 12-16-2009 08:13 PM
PubMed: ABySS: A parallel assembler for short read sequence data. Newsbot! Literature Watch 0 03-03-2009 05:00 AM

Reply
 
Thread Tools
Old 11-03-2010, 07:11 PM   #41
seb567
Senior Member
 
Location: Québec, Canada

Join Date: Jul 2008
Posts: 260
Default Response to 'Congrats on e-publication'

Quote:
Congratulations on Ray's epub!
Thanks !

Quote:
I learned of your work about 2 weeks ago, familiarized myself with the documentation you've provided, and successfully completed the E. coli assembly with sample data.
Now that is reproducible research !

Quote:
As the publication is now complete, could you please provide some more details with respect to your processing of Human Chromosome 1 (under the limitations section of the project website).
Well, I used Open-1.3.4 with shared memory disabled. Ray 0.0.7 was
utilized.

I simulated reads of length 50 at a depth of 50 for the human chromosome
1 (the largest). To do so, I used the simtools provided with Ray. To get
them, type 'make simtools'.

The wiki is misleading on this, because I actually used a MPI-enabled
Infiniband-connected computer. I'll correct that shortly. Precisely, 384
cores were used.

The Sun Grid Engine script follows.

PHP Code:
[12@colosse2 0.0.7-run]$ cat Human-chr1-ompi-1.3.4-gcc.sh
#!/bin/bash
#$ -N Ray
#$ -P nne-790-aa
#$ -l h_rt=24:00:00
#$ -pe mpi 384
module load compilers/gcc/4.4.2 mpi/openmpi/1.3.4_gcc
/software/MPI/openmpi-1.3.4_gcc/bin/mpirun /home/12/Ray/tags/0.0.7/Ray /home/12/nne-790-aa/colosse.clumeq.ca/qsub/Ray-input.txt 
If you ask why Open-MPI 1.3.4, it is because all other versions have
shared memory enabled on the said computer, and that Open-MPI 1.4.3 is
not available yet to users of the said computer.

The content of the command file:

PHP Code:
[12@colosse2 0.0.7-run]$ cat Ray-input.txt 
LoadSingleEndReads 
/home/12/nne-790-aa/50xhs_ref_GRCh37_chr1.fa_fragments.fasta 
Quote:
Specifically, could you provide the output text for that run so that I could better ascertain:
a. the run time for that assembly on your hardware
b. the version of the Open-MPI Library used in that assembly

PHP Code:
[12@colosse2 0.0.7-run]$ cat Ray.o876984
**************************************************
This program comes with ABSOLUTELY NO WARRANTY.
This is free software, and you are welcome to redistribute it
under certain conditions
see "gpl-3.0.txt" for details.
**************************************************

Ray Copyright (C2010  Sébastien BoisvertJacques CorbeilFrançois
Laviolette
http
://denovoassembler.sf.net/

AssemblyEngineRay 0.0.7
NumberOfRanks
384
MPILibrary
Open-MPI 1.3.4
OperatingSystem
Linux

LoadSingleEndReads
Sequences
: /home/12/nne-790-aa/50xhs_ref_GRCh37_chr1.fa_fragments.fasta

Loading 
/home/12/nne-790-aa/50xhs_ref_GRCh37_chr1.fa_fragments.fasta
Distributing sequences
Counting vertices
Loading 
/home/12/nne-790-aa/50xhs_ref_GRCh37_chr1.fa_fragments.fasta
Indexing sequences
Connecting vertices
MinimumCoverage
5
PeakCoverage
30
Computing seeds
Extending seeds
Computing fusions
Finishing fusions
Collecting fusions
              
Writing Ray
-Contigs.fasta
140101 contigs
/175230944 nucleotides
Elapsed time
0 d 13 h 48 min 28 s 

Quote:
I'd like to use Ray for assembly of sequencing reads for eukaryotes... and would like to know what, if any, potential problems to anticipate.
I have not myself extensively used Ray on eukaryotic sequence reads, so I am not really aware of potential pitfalls.

Quote:
As Open-MPI 1.5 has now been released, I'd like to know if the shared memory problem is still a concern when performing analyses of larger datasets. I believe that this has been fixed in versions > 1.4.1, but would like to know for certain if it is a problem with Ray before spending hours of analysis time on shared hardware.
You better use Open-MPI 1.4.3 as it is a super stable release whereas Open-MPI 1.5 is a feature release. I only have access to Open-MPI 1.3.4 with disabled shared memory and Open-MPI 1.4.1 with defaults.

I should gain access to Open-MPI 1.4.3 with defaults in the next days/weeks.

Quote:
Thanks for both your work and time in addressing these questions-- your efforts are very much appreciated.
Thank you also for bringing these questions.

Sébastien
seb567 is offline   Reply With Quote
Old 11-03-2010, 07:58 PM   #42
seb567
Senior Member
 
Location: Québec, Canada

Join Date: Jul 2008
Posts: 260
Default Ray 0.1.0 is out

Dear de novo assembly enthusiasts:

Following the publication and some work over the last months, Ray 0.1.0
is now available incorporating (some) features requested as well as
improvements on speed (Extending seeds).

There is a full list of changes, based on the NEWS file.

v. 0.1.0
2010-11-03

* Moved some code from Machine.cpp to new files. (Ticket #116)
* Improved the speed of the extension of seeds by reducing the number of messages sent. (Tickets #164 & #490)
Thanks to all the people who reported this on the list !
* Ray is now verbose ! (Ticket #167)
Feature requested by Dr. Torsten Seemann (Victorian Bioinformatics Consortium, Dept. Microbiology, Monash
University, AUSTRALIA)
* The k-mer size can now be changed. Minimum value is 15 & maximum value is 32. (Tickets #169 & #483)
Feature requested by Dr. Torsten Seemann (Victorian Bioinformatics Consortium, Dept. Microbiology, Monash
University, AUSTRALIA)
* Ray should work now on architectures requiring alignments of address on 8 bytes such as Itanium. (Ticket #446)
Bug reported by Jordi Camps Puchades (Centre Nacional d'Anàlisi Genòmica/CNAG)
* Added reference to the paper in stdout. (Ticket #479)
* The coverage distribution is now always written. (Ticket #480)
* The code for extracting edges is now in a separate file (Ticket #486)
* Messages for paired reads are now grouped with messages for querying sequences in the extension of seeds. (Tickets #487 & #495)
* Messages for sequence reads are now done only once, when the read is initially discovered. (Ticket #488)
* Messages with tag TAG_HAS_PAIRED_READ are grouped with messages to get sequence reads. (Ticket #491)
* Added TimePrinter to print the elapsed time at each step. (Ticket #494)
* All generated files (AMOS, Contigs, and coverage distribution) are named following the -o parameter. (Ticket #426)
Feature requested by Jordi Camps Puchades (Centre Nacional d'Anàlisi Genòmica/CNAG)
* Print an exception if requested memory exceeds CHUNK_SIZE. That should never happen. (r3690)
* Print an exception if the system runs out of memory.
* Ray informs you on the number of k-mers for a k-mer size. (r3691)
* Unique IDs of sequence reads are now unsigned 64-bits integers. (r3710)
* The code is now in code/, scripts are now in scripts/. Examples are in scripts/examples/. (r3712)
* The compilation is more verbose. (r3714)


Download it:
http://sourceforge.net/projects/deno...r.bz2/download

I will update the wiki shortly with improved running times for the E.
coli dataset as well as in-depth simulation of paired reads on
chromosome 1 (with errors).

Thank you !
seb567 is offline   Reply With Quote
Old 11-07-2010, 11:44 PM   #43
pallo
Member
 
Location: Land of ice and snow

Join Date: Oct 2009
Posts: 10
Default

First of all: thanks for providing Ray. I am reading the paper and it sounds very promising.

I am testing version 0.1.0 (openMPI 1.4.2, compiled with intel11.1) on 8 X 8 core/48GB nodes. The data are 12 lanes of illumina PE reads and two runs of 454 of a bird species we are sequencing. The first 14 hrs Ray output tons of messages in ray.out, but for the past 36hrs has been quiet, but still keeping a 100% load on the nodes, utilizing about 5GB of memory for each job.

Is this quiet to be expected, or a manifestation of this "spin-lock" bug mentioned above? Is there any way of checking that Ray is still running OK?
Cheers
Pallo

EDIT: Ok looking closer at the spin-lock bug reports, it only seems to affect GCC, so Ill try to be patient

Last edited by pallo; 11-08-2010 at 01:54 AM.
pallo is offline   Reply With Quote
Old 11-08-2010, 05:56 AM   #44
seb567
Senior Member
 
Location: Québec, Canada

Join Date: Jul 2008
Posts: 260
Default

Where does it hang?

I think the bug 2043 was addressed in Open-MPI 1.4.3.

I don't know if ICC can produce the same problem though.




Yes, there is a way if you can log on the worker nodes.


First, get the pid of the processes associated to Ray

ps aux|grep Ray

then, attach a gdb instance to one of them.

gdb attach <pid of a Ray instance>

Finally, do a backtracking in gdb

bt

You will see which code is currently executed.



What is your interconnection?

Infiniband of gigaethernet ?
seb567 is offline   Reply With Quote
Old 11-08-2010, 10:09 PM   #45
pallo
Member
 
Location: Land of ice and snow

Join Date: Oct 2009
Posts: 10
Default

Hi,

The job had to be killed for other reasons, but here are the last lines of ray.out:

Quote:
$ tail testrun/ray.out
Rank 51 stores an extension, 1354 vertices.
Rank 51 starts on a seed, length=106
Rank 43 starts on a seed, length=444
Rank 4 stores an extension, 1166 vertices.
Rank 4 starts on a seed, length=142
Rank 59 stores an extension, 152 vertices.
Rank 59 starts on a seed, length=211
Rank 6 stores an extension, 1095 vertices.
Rank 6 starts on a seed, length=89
Rank 0 starts on a seed, length=1175
The interconnections are Infiniband.

Im rerunning the job on a bigger set of nodes, Ill post the progress.

cheers
Pallo
pallo is offline   Reply With Quote
Old 11-09-2010, 05:01 AM   #46
seb567
Senior Member
 
Location: Québec, Canada

Join Date: Jul 2008
Posts: 260
Default

I think I found the reason behind all the hanging.

I myself experienced the hanging with shared memory disabled, using 384 cores (Xeon).

It is more likely to be a MPI rank being flooded by messages and being unable to response than anything else I believe.

I am currently testing regularization of message sending in the extension of seeds. Ensuring that a particular number of microseconds between messages is what I am testing.

If that fails, Ray will simply do the extension of seeds on MPI rank after another.

In the other steps of the algorithm (distribution of vertices, for example), the messages sent are uniformly observed.

With the detailed information you provided, I can safely say that running on more cores won't change anything with Ray 0.1.0 and below.

Thank you.
seb567 is offline   Reply With Quote
Old 11-19-2010, 03:26 AM   #47
PHSchi
Member
 
Location: Germany

Join Date: Jun 2010
Posts: 12
Default memory consumption and other issues

Hi everyone!

We are trying to run Ray on an IntelMPI based Linux (RH) cluster. Recently one of my test jobs crashed, i.e. killed the node it was running on. As we can't access it's run stats I have some questions:
Given a Ray run is started with ~ 202,136,864 Illumina pe reads (100bp on average) what would be the expected peak memory requirement? Anybody any estimates from their own experience?
And what does everybody see in terms of runtime for their Ray assemblies? We are testing on one node with 8 cores in the moment, as earlier tests with multiple nodes crashed and took other running procs to hell with them.
Any help with this would be highly appreciated!
Btw. anybody out there actually running Ray with IntelMPI?

Regards,

Philipp
PHSchi is offline   Reply With Quote
Old 11-23-2010, 08:31 AM   #48
mrawlins
Member
 
Location: Retirement - Not working with bioinformatics anymore.

Join Date: Apr 2010
Posts: 63
Default

Ray has worked great for our work with Illumina and 454 reads, but is giving us some trouble in our tests on SOLiD data. Our test set is from the NCBI SRA (run SRR035444, submission SRA009283), which can be downloaded either from NCBI or from the SRA at EBI.
Ray 0.1.0 gets just past the coverage distribution and hangs. The -TheCoverageDistribution.tab file reads:

#Coverage NumberOfVertices
255 2
1431655765 16859178292550

which seems wrong to me.
Ray was compiled using OpenMPI 1.5 and gcc 4.5.1. I get this same error using anywhere between 7 and 48 nodes, and it doesn't seem to be a memory issue. If anybody has experienced this sort of thing and/or has a recommendation on how to fix it that would be great.
mrawlins is offline   Reply With Quote
Old 11-23-2010, 11:29 PM   #49
pallo
Member
 
Location: Land of ice and snow

Join Date: Oct 2009
Posts: 10
Default

@seb567: You are right, trying to run on a bigger set of nodes (20x8cores) as well as a smaller set of larger memory cores made no difference - the jobs hang within 24hrs of startup and do so using 100% cpu load until killed. If I can provide any further debug info, let me know...

@PHSchi: Im using openMPI 1.4.2 but for approx 500M 100bp paired Illumina reads, random checking on nodes suggests that the total memory usage was 400-600GB, but thats just a finger in the wind estimate and the jobs got stuck so I cant say for sure. The target genome is estimated at around 1.3Gbases. So a very linear guess is that you need at least half of this, that is 2-300GB

cheers
pallo
pallo is offline   Reply With Quote
Old 11-25-2010, 06:48 AM   #50
seb567
Senior Member
 
Location: Québec, Canada

Join Date: Jul 2008
Posts: 260
Default Ray 1.0.0 is compliant the standard MPI 2.2 !

Warning: long post ahead.

Statements:

1. Ray 0.1.0 and before were not 100 % compatible with the standard MPI 2.2. Thus, Ray hanged sometimes.

2. Ray 1.0.0 is compliant with the standard MPI 2.2.

3. Ray 1.0.0 __SHOULD__ not hang.

4. Ray 1.0.0 is released.


Now, let me answer your questions.


@seb567 (self) 11-09-2010, 06:01 AM

Quote:
I think I found the reason behind all the hanging.

I myself experienced the hanging with shared memory disabled, using 384 cores (Xeon).



Quote:
It is more likely to be a MPI rank being flooded by messages and being unable to response than anything else I believe.
As George Bosilca puts it:

Quote:
No message is eager if there is congestion. 64K is eager for TCP only if the kernel buffer has enough room to hold the 64k. For SM it only works if there are ready buffers. In fact, eager is an optimization of the MPI library, not something the users should be aware of, or base their application on this particular behavior.

On the MPI 2.2 there is a specific paragraph that advice the users not to do it.

http://www.open-mpi.org/community/li...10/11/8702.php
Quote:
I am currently testing regularization of message sending in the extension of seeds. Ensuring that a particular number of microseconds between messages is what I am testing.
That failed.

Quote:
If that fails, Ray will simply do the extension of seeds on MPI rank after another.
That was not fast, and failed with MPICH2.

The ultimate solution was to read the standard MPI 2.2.

http://www.mpi-forum.org/docs/mpi-2.2/mpi22-report.pdf
Warning: 647 pages, very technical.

Quote:
In the other steps of the algorithm (distribution of vertices, for example), the messages sent are uniformly observed.
But still, MPI_Send can block !

Note that MPI_Send was replaced with MPI_Isend in Ray 1.0.0.

Quote:
With the detailed information you provided, I can safely say that running on more cores won't change anything with Ray 0.1.0 and below.

Thank you.
Ray 1.0.0 is compliant with MPI 2.2 and should not hang.



@PHSchi 11-19-2010, 04:26 AM

Quote:
Hi everyone!

We are trying to run Ray on an IntelMPI based Linux (RH) cluster. Recently one of my test jobs crashed, i.e. killed the node it was running on.
IntelMPI is based, I believe, on MPICH2. Thus, Ray 1.0.0 will works fine, but not previous versions.

Quote:
As we can't access it's run stats I have some questions:
If you start your jobs with qsub (Oracle/Sun Grid Engine), try to modify and run qhost.py, which is readily available in scripts/ from the Ray 1.0.0 distribution. The script uses 'qhost -j -xml>dump.xml' and then parse the XML file.

Quote:
Given a Ray run is started with ~ 202,136,864 Illumina pe reads (100bp on average) what would be the expected peak memory requirement? Anybody any estimates from their own experience?
Memory usage depends mainly on the genome size and error rates.

Quote:
And what does everybody see in terms of runtime for their Ray assemblies?
End-users working with bacterial data are satisfied.

I don't know for others.

Quote:
We are testing on one node with 8 cores in the moment, as earlier tests with multiple nodes crashed and took other running procs to hell with them.
8 cores sound low for ~ 202,136,864 Illumina pe reads.

Quote:
Any help with this would be highly appreciated!
I hope Ray 1.0.0 works for you!


Quote:
Btw. anybody out there actually running Ray with IntelMPI?

Regards,

Philipp
As I wrote, IntelMPI is based on MPICH2.
see http://www.mcs.anl.gov/research/proj...x.php?s=collab

With Ray 1.0.0, IntelMPI should work fine. And that should be true with g++ and icc.


@mrawlins 11-23-2010, 09:31 AM


Quote:
Ray has worked great for our work with Illumina and 454 reads,
Yes, mixing technologies eliminates 454 homopolymer errors and Illumina shorter read length.

Quote:
but is giving us some trouble in our tests on SOLiD data. Our test set is from the NCBI SRA (run SRR035444, submission SRA009283), which can be downloaded either from NCBI or from the SRA at EBI.
I added a ticket, but my tests with public datasets from solidsoftwaretools indicated that the error rate of this technology does not allow a de novo assembly with Ray.

For instance, with k=21, you probably want the error (substitution) rate to be below 1/21. Otherwise any k-mer will be erroneous, and thus unique !

1 / 21 = 0,0476190476 = 4.76 %

If I remember well, error rates for these datasets were above that (~12 % or so, I think).

http://solidsoftwaretools.com/

Datasets are:

SOLiD™4 System E.Coli DH10B Fragment Data Set
SOLiD™ System E.Coli DH10B 50X50 Mate-Pair Data Set


Quote:
Ray 0.1.0 gets just past the coverage distribution and hangs. The -TheCoverageDistribution.tab file reads:
#Coverage NumberOfVertices
255 2
1431655765 16859178292550
I think that does not mean anything. 1431655765 is just not possible because the maximum value is 255.

Can you try again with Ray 1.0.0 and post/send me the results ?

Quote:
which seems wrong to me.
You are not alone.

Quote:
Ray was compiled using OpenMPI 1.5 and gcc 4.5.1.
You are better off with Open-MPI 1.4.3 or MPICH2 1.3.1 or any other super-stable releases. Open-MPI 1.5 is a beta 'feature release'.

Quote:
I get this same error using anywhere between 7 and 48 nodes, and it doesn't seem to be a memory issue.
I would bet on an error rate above 1/k. Try

mpirun -np 40 -k 15 -p dataLEFT.fastq.bz2 dataRIGHT.fastq.gz

Supposing that your genome/transcriptome size is far below 1 073 741 824.

Code:
4^15 =                    1 073 741 824
4^21 =              4 398 046 511 104
4^32 = 18 446 744 073 709 551 616
Quote:
If anybody has experienced this sort of thing and/or has a recommendation on how to fix it that would be great.
Well, again, my tests on the datasets from http://solidsoftwaretools.com/ indicated that the error rate of the SOLiD technology is not friendly with de novo assembly with Ray.

Let us hope that 'Exact Call Chemistry' will fix that.

http://www3.appliedbiosystems.com/cm...cms_088755.pdf

http://www.news-medical.net/news/201...ieves-greater-


@pallo Yesterday, 12:29 AM


Quote:
@seb567: You are right, trying to run on a bigger set of nodes (20x8cores) as well as a smaller set of larger memory cores made no difference - the jobs hang within 24hrs of startup and do so using 100% cpu load until killed. If I can provide any further debug info, let me know...
Can you try with Ray 1.0.0 as it is compliant with the standard MPI 2.2 ?

I replaced MPI_Send with MPI_Isend, and I carefully added some sort of busy-waiting before sending additional messages. Note that I say 'some sort' because an MPI rank can still receive MPI messages while waiting.

Also, I removed calls to MPI_Iprobe, and I replaced them with a ring of 128 bins of MPI requests that are MPI_Recv_init'ed & MPI_Start'ed at the start of computation.

Credit for this idea goes to George Bosilca (University of Tennessee & MPI/Open-MPI researcher/scientist).

http://www.open-mpi.org/community/li...10/11/8710.php


Quote:
@PHSchi: Im using openMPI 1.4.2 but for approx 500M 100bp paired Illumina reads, random checking on nodes suggests that the total memory usage was 400-600GB, but thats just a finger in the wind estimate and the jobs got stuck so I cant say for sure. The target genome is estimated at around 1.3Gbases.
Given the genome size and the presence of errors, I must agree with your estimate.

In an MPI rank provide you with 3 gigabytes of memory, then you need around 200 MPI ranks.

Code:
600 / 3 = 200
Contrary to ABySS, which uses google-sparsehash to store data on disk --at least that was true the last time I checked, Ray stores everything in memory.

http://code.google.com/p/google-sparsehash/

Quote:
So a very linear guess is that you need at least half of this, that is 2-300GB

cheers
pallo

For sure you can't buy that if you work in a laboratory.

However, in the United States of America, the National Center for Computational Sciences provides resources to scientists.

http://www.nccs.gov/


In Canada, Compute Canada/Calcul Canada (on parle français et anglais !) provides compute resources to scientists.

https://computecanada.org/

Acknowledgment for Ray 1.0.0

Élénie Godzaridis (Institut de biologie intégrative et des systèmes de l'Université Laval) for suggesting using End of transmission to pack sequences & suggesting using enum for constants.

George Bosilca (University of Tennessee) for MPI_Recv_init/MPI_Start and for pointing out that MPI_Send can block even below the eager threshold.

Jeff Squyres (Cisco) for pointing out that MPI_Send to self is not safe and that MPI_Request_free on an active request is evil.

Eugene Loh (Oracle) for the correct eager threshold (4000 bytes, not 4096 bytes).

René Paradis (Centre de recherche du CHUL) for giving me a good-old Sun
Blade 100 (SPARC V9, TI UltraSparc IIe (Hummingbird) & for maintaining my testing boxes.

Torsten Seemann (Victorian Bioinformatics Consortium, Dept. Microbiology, Monash University, AUSTRALIA) for suggesting that Ray should load interleaved files and GZIP-compressed files.

Frédéric Lefebvre (CLUMEQ - Université Laval) for installing software on the mighty colosse. http://www.top500.org/system/10195

The Canadian Institutes of Health Research for my scholarship.


ChangeLog for Ray 1.0.0

v. 1.0.0

r4038 | 2010-11-25

* Made a lots of changes to make Ray compliant with the standard MPI 2.2
* Added master and slave modes.
* Added an array of master methods (pointers): selecting the master method
with the master mode is done in O(1).
* Added an array of slave methods (pointers): selecting the slave method
with the master mode is done in O(1).
* Added an array of message handlers (pointers): selecting the message handler method
with the message tag is done in O(1).
* Replaced MPI_Send by MPI_Isend. Thanks to Open-MPI developpers for their
support and explanation on the eagerness of Open-MPI: George Bosilca (University of Tennessee), Jeff Squyres (Cisco), Eugene Loh (Oracle)
* Moved some code for the extension of seeds.
* Grouped messages for library updates.
* Added support for paired-end interleaved sequencing reads (-i option)
Thanks to Dr. Torsten Seemann (Victorian Bioinformatics Consortium, Dept. Microbiology, Monash University, AUSTRALIA) for suggesting the feature !
* Moved detectDistances & updateDistances in their own C++ file.
* Updated the Wiki.
* Decided that the next release was 1.0.0.
* Added support for .fasta.gz and .fastq.gz files, using libz (GZIP).
Thanks to Dr. Torsten Seemann (Victorian Bioinformatics Consortium, Dept. Microbiology, Monash University, AUSTRALIA) for suggesting the feature !
* Tested with k=17: it uses less memory, but is less precise.
* Fixed a memory allocation bug when the code runs on 512 cores and more.
* Added configure script using automake & autoconf.
Note that if that fails, read the INSTALL file !
* Moved the code that loads fasta files in FastaLoader.
* Moved the code that loads fastq files in FastqLoader.
* Regulated the communication in the MPI 'tribe'.
* Added an assertion to verify the message buffer length before sending it.
* Modified bits so that if a message is more than 4096 bytes, split it in
chunks.
* Used a sentinel to remove two messages, coupled with TAG_REQUEST_READS.
* Stress-tested with MPICH2.
* Implemented a ring allocator for inboxes and outboxes.
* Changed flushing so that all use <flush> & <flushAll> in BufferedData.
* Changed the maximum message size from 4096 to 4000 to send messages eagerly
more often (if it happens). Thanks to Open-MPI developpers for their support and explanation on the eagerness of Open-MPI: Eugene Loh (Oracle), George Bosilca (University of Tennessee), Jeff Squyres (Cisco).
* Changed the way sequencing reads are indexed: before the master was
reloading (again !) files to do so, now no files are loaded and every MPI ranks participate in the task.
* Modified the way sequences are distributed. These are now appended to fill the buffer, and
the sentinel called 'End of transmission' is used. Thanks to Élénie Godzaridis for pointing out that '\0' is not a valid sentinel for strings !
* Optimized the flushing in BufferedData: flush is now destination-specific.
O(1) instead of O(n) where n is the number of MPI ranks.
* Optimized the extension: paired information is appended in the buffer in
which the sequence itself is.
* Added support for .fasta.bz2 & .fastq.bz2. This needs LIBBZ2 (-lbz2)
* Added instructions in the INSTALL file for manually compiling the source in
case the configure script gets tricky (cat INSTALL).
* Added a received messages file. This is pretty useless unless you want to
see if the received messages are uniform !.
* Added bits to write the fragment length distribution of each library.
* Changed the definition of MPI tags: they are now defined with a enum.
Thanks to Élénie Godzaridis for the suggestion.
* Changed the definition of slave modes: they are now defined with a enum.
Thanks to Élénie Godzaridis for the suggestion.
* Changed the definition of master modes: they are now defined with a enum.
Thanks to Élénie Godzaridis for the suggestion.
* Optimized finishFusions: complexity changed from O(N*M) to O(N log M).
* Designed a beautiful logo with Inkscape.
* Added a script for regression tests.
* Changed bits so that a paired read is not updated if it does not need it
* Changed the meaning of the -o parameter: it is now a prefix.
* Added examples with MPICH2, Open-MPI, and Open-MPI/SunGridEngine.
* Changed DEBUG for ASSERT as it activates assertions.
* Updated the citation in the standard output.
* Corrected the interleave-fastq python script.
* Changed the license file from LICENSE to COPYING.
* Removed the trimming of reads if they are not read from a file.
* Increased the verbosity of the extension step.
* Added gnuplot scripts.
* Changed the file name for changes: from NEWS to ChangeLog.
* Optimized the MPI layer: replaced MPI_Iprobe by MPI_Recv_init+MPI_Start.
see MessagesHandler.cpp ! (Thanks to George Bosilca (University of Tennessee) for the suggestion !
* Compiled and tested on architecture SPARC V9 (sparc64).
* Compiled and tested on architecture Intel Itanium (ia64).
* Compiled and tested on architecture Intel64 (x86_64).
* Compiled and tested on architecture AMD64 (x86_64).
* Compiled and tested on Intel architecture (x86/ia32).
* Evaluated regression tests.
seb567 is offline   Reply With Quote
Old 11-25-2010, 03:58 PM   #51
jfpombert
Junior Member
 
Location: Vancouver

Join Date: Nov 2010
Posts: 5
Question Ray 1.0.0 doesn't load reads

Hi Seb,

I compiled Ray version 1.0.0 today (openMPI 1.4.3, gcc 4.5.1, Fedora 14) and when I run the new executable it stops at loading the first of the paired end Solexa reads (Rank 0 loads nameoffile) and exits. When I use the previous Ray version (0.1.0) with the same command line on the same dataset it runs fine. Tried compiling it twice but got the same result.

Jean-Francois Pombert

2x Xeon E5506
96G RAM
Intel Server Board S5520HC
Linux kernel 2.6.35.6-48
jfpombert is offline   Reply With Quote
Old 11-25-2010, 04:39 PM   #52
seb567
Senior Member
 
Location: Québec, Canada

Join Date: Jul 2008
Posts: 260
Default

Quote:
Ray 1.0.0 doesn't load reads
Hi Seb,

I compiled Ray version 1.0.0 today (openMPI 1.4.3, gcc 4.5.1, Fedora 14) and when I run the new executable it stops at loading the first of the paired end Solexa reads (Rank 0 loads nameoffile) and exits. When I use the previous Ray version (0.1.0) with the same command line on the same dataset it runs fine. Tried compiling it twice but got the same result.

Jean-Francois Pombert

2x Xeon E5506
96G RAM
Intel Server Board S5520HC
Linux kernel 2.6.35.6-48
Can you provide more details (by email if you wish) ?

The module for loading sequences from files have not changed much, but the distribution of sequences has.

However, I have not seen that glitch.
seb567 is offline   Reply With Quote
Old 11-25-2010, 05:04 PM   #53
jfpombert
Junior Member
 
Location: Vancouver

Join Date: Nov 2010
Posts: 5
Default

Here is the console log. I`ll look again at the compilation. I might have goofed somehow.

Thx

JF

****************************************
[David@bigdaddy Ray]$ mpirun -np 8 Ray -p 100420_s_7_1_seq_GKD-1.txt 100420_s_7_2_seq_GKD-1.txt -s FQH37LX05.sff -s FQH37LX06.sff -s FTX7HMM01.sff -s FU6LJ3H01.sff -s FWZEL0L06.sff -o test.txt
Bienvenue !

Rank 0: Ray 1.0.05.sff -s FQH37LX06.sff -s FTX7HMM01.sff -s FU6LJ3H01.sff -s FWZEL0L06.sffRank 0: compiled with Open-MPI 1.4.3
seq_GKD-1.txt -s FQH37LX05.sff -s FQH37LX06.sff -s FTX7HMM01.sff -s FU6LJ3H01.sff
Rank 0 reports the elapsed time, Thu Nov 25 17:48:37 2010HMM01.sff -s FU6LJ3H01. ---> Step: Beginning of computation
Elapsed time: 1 seconds
Since beginning: 1 seconds


**************************************************
This program comes with ABSOLUTELY NO WARRANTY.
This is free software, and you are welcome to redistribute it
under certain conditions; see "COPYING" for details.
**************************************************

Ray Copyright (C) 2010 Sébastien Boisvert, Jacques Corbeil, François Laviolette
Centre de recherche en infectiologie de l'Université Laval
Project funded by the Canadian Institutes of Health Research (Doctoral award 200902CGM-204212-172830 to S.B.)
http://denovoassembler.sf.net/

Reference to cite:

Sébastien Boisvert, François Laviolette & Jacques Corbeil.
Ray: simultaneous assembly of reads from a mix of high-throughput sequencing technologies.
Journal of Computational Biology (Mary Ann Liebert, Inc. publishers, New York, U.S.A.).
November 2010, Volume 17, Issue 11, Pages 1519-1533.
doi:10.1089/cmb.2009.0238
http://dx.doi.org/doi:10.1089/cmb.2009.0238

Rank 0 welcomes you to the MPI_COMM_WORLD
Rank 0 is running as UNIX process 18016 on bigdaddy
Rank 2 is running as UNIX process 18018 on bigdaddy
Rank 3 is running as UNIX process 18019 on bigdaddy
Rank 5 is running as UNIX process 18021 on bigdaddy
Rank 1 is running as UNIX process 18017 on bigdaddy
Rank 4 is running as UNIX process 18020 on bigdaddy
Rank 7 is running as UNIX process 18023 on bigdaddy
Rank 0: I am the master among 8 ranks in the MPI_COMM_WORLD.

Ray command:

Ray \
-p \
100420_s_7_1_seq_GKD-1.txt \
100420_s_7_2_seq_GKD-1.txt \
-s \
FQH37LX05.sff \
-s \
FQH37LX06.sff \
-s \
FTX7HMM01.sff \
-s \
FU6LJ3H01.sff \
-s \
FWZEL0L06.sff \
-o \
test.txt

-p (paired-end sequences)
Left sequences: 100420_s_7_1_seq_GKD-1.txt
Right sequences: 100420_s_7_2_seq_GKD-1.txt
Average length: auto
Standard deviation: auto

-s (single sequences)
Sequences: FQH37LX05.sff

-s (single sequences)
Sequences: FQH37LX06.sff

-s (single sequences)
Sequences: FTX7HMM01.sff

-s (single sequences)
Sequences: FU6LJ3H01.sff

-s (single sequences)
Sequences: FWZEL0L06.sff

k-mer size: 21
--> Number of k-mers of size 21: 4398046511104
*** Note: A lower k-mer size bounds the memory usage. ***

Rank 0 is loading 100420_s_7_1_seq_GKD-1.txt
Rank 6 is running as UNIX process 18022 on bigdaddy
[David@bigdaddy Ray]$
********************************************************
jfpombert is offline   Reply With Quote
Old 11-25-2010, 05:34 PM   #54
seb567
Senior Member
 
Location: Québec, Canada

Join Date: Jul 2008
Posts: 260
Default

Très cher Jean-Francois Pombert,


Thank you for your timely answer.

In Ray 0.1.0 and before, fasta and fastq were detected using the first line in the file.

In Ray 1.0.0, I solely use the file extension to select the appropriate loader.

Quote:
Ray \
-p \
100420_s_7_1_seq_GKD-1.txt \
100420_s_7_2_seq_GKD-1.txt \
-s \
FQH37LX05.sff \
-s \
FQH37LX06.sff \
-s \
FTX7HMM01.sff \
-s \
FU6LJ3H01.sff \
-s \
FWZEL0L06.sff \
-o \
test.txt
So, Ray does not know what to do with .txt files and just stops.

Quote:
Usage:

Supported sequences file format:
.fasta
.fasta.gz
.fasta.bz2
.fastq
.fastq.gz
.fastq.bz2
.sff (paired reads must be extracted manually)


Parameters:

Single-end reads
-s <sequencesFile>

Paired-end reads:
-p <leftSequencesFile> <rightSequencesFile> [ <fragmentLength> <standardDeviation> ]

Paired-end reads:
-i <interleavedFile> [ <fragmentLength> <standardDeviation> ]

Output (default: Ray-Contigs.fasta)
-o <outputFile>

AMOS output
-a

k-mer size (default: 21)
-k <kmerSize>


I will add a specific message to alarm the user about the extension.

Thank you for your interest in Ray !
seb567 is offline   Reply With Quote
Old 11-25-2010, 05:44 PM   #55
seb567
Senior Member
 
Location: Québec, Canada

Join Date: Jul 2008
Posts: 260
Default

@jfpombert

I forgot to provide a fix.

quick fix:

Quote:
ln -s 100420_s_7_1_seq_GKD-1.txt 100420_s_7_1_seq_GKD-1.txt.fastq

ln -s 100420_s_7_2_seq_GKD-1.txt 100420_s_7_2_seq_GKD-1.txt.fastq

mpirun -np 8 \
Ray \
-p \
100420_s_7_1_seq_GKD-1.txt.fastq \
100420_s_7_2_seq_GKD-1.txt.fastq \
-s \
FQH37LX05.sff \
-s \
FQH37LX06.sff \
-s \
FTX7HMM01.sff \
-s \
FU6LJ3H01.sff \
-s \
FWZEL0L06.sff \
-o \
test.txt
or using bzip2, you will save precious space:

Quote:
bzip2<100420_s_7_1_seq_GKD-1.txt>100420_s_7_1_seq_GKD-1.txt.fastq.bz2

bzip2<100420_s_7_2_seq_GKD-1.txt>100420_s_7_2_seq_GKD-1.txt.fastq.bz2

Ray \
-p \
100420_s_7_1_seq_GKD-1.txt.fastq.bz2 \
100420_s_7_2_seq_GKD-1.txt.fastq.bz2 \
-s \
FQH37LX05.sff \
-s \
FQH37LX06.sff \
-s \
FTX7HMM01.sff \
-s \
FU6LJ3H01.sff \
-s \
FWZEL0L06.sff \
-o \
test.txt

Thank you for providing a detailed report of what you did.
seb567 is offline   Reply With Quote
Old 11-25-2010, 05:48 PM   #56
jfpombert
Junior Member
 
Location: Vancouver

Join Date: Nov 2010
Posts: 5
Default

Ok, great, i'll just change the extensions.

Un gros merci!

JF
jfpombert is offline   Reply With Quote
Old 11-29-2010, 09:30 AM   #57
caddymob
Member
 
Location: USA

Join Date: Apr 2009
Posts: 36
Question processes aborted

Really excited to try out Ray! I first tried to grab the example datasets, but the links are dead.. getting a 550 error, no such file..

So then I went ahead and tried to assemble my own genome of a very homozygus (>96%) mammalian genome sequenced on illumina with paired 105bp reads. Ray is failing and I do not understand why.

Ray ran for just over 2 hours on 256 cores before dying. Here are my commands:

Code:
use intel-openmpi-1.4.2
use Ray-0.1.0

mpirun -np 256 Ray -p $wd\Lunde_1.fq $wd\Lunde_2.fq -o Lunde-contigs
And the output I get:

Code:
Rank 0 welcomes you to the MPI_COMM_WORLD.
Rank 0: website -> http://denovoassembler.sf.net/
Rank 0: using Open-MPI 1.2.7
Rank 0 is running as UNIX process 4193 on s28-2.local (MPI version 2.0)
.
.
.
Rank 243 is running as UNIX process 4231 on s26-4.local (MPI version 2.0)
Rank 0: I am the master among 256 ranks in the MPI_COMM_WORLD.

Rank 0: Ray 0.1.0 is running
Rank 0: operating system is Linux (during compilation)

LoadPairedEndReads
 Left sequences: /scratch/jcorneveaux/LUNDE_ASSEMBLE/Lunde_1.fq
 Right sequences: /scratch/jcorneveaux/LUNDE_ASSEMBLE/Lunde_2.fq
 Average length: auto
 Standard deviation: auto

k-mer size: 21
 --> Number of k-mers of size 21: 4398046511104
  *** Note: A lower k-mer size bounds the memory usage. ***


Rank 0 loads /scratch/jcorneveaux/LUNDE_ASSEMBLE/Lunde_1.fq.
Rank 0 has 140174250 sequences to distribute.
Rank 0 distributes sequences, 1/140174250
mpirun noticed that job rank 1 with PID 4194 on node s28-2 exited on signal 15 (Terminated). 
254 additional processes aborted (not shown)
1 process killed (possibly by Open MPI)Rank 0 welcomes you to the MPI_COMM_WORLD.
Rank 0: website -> http://denovoassembler.sf.net/
Rank 0: using Open-MPI 1.2.7
Rank 0 is running as UNIX process 4193 on s28-2.local (MPI version 2.0)
Is there something wrong with my configuration?
caddymob is offline   Reply With Quote
Old 11-29-2010, 09:49 AM   #58
seb567
Senior Member
 
Location: Québec, Canada

Join Date: Jul 2008
Posts: 260
Default

Quote:
Really excited to try out Ray! I first tried to grab the example datasets, but the links are dead.. getting a 550 error, no such file..
NCBI moved their infrastructure from .fastq to .sra files.

My favorite toy dataset is SRA001125, Illumina data of E. coli K-12 MG1655.

Search SRA001125 and you'll find it.

Quote:
So then I went ahead and tried to assemble my own genome of a very homozygus (>96%) mammalian genome sequenced on illumina with paired 105bp reads. Ray is failing and I do not understand why.
Many reasons can explain that.



Ray ran for just over 2 hours on 256 cores before dying. Here are my commands:


Quote:

use intel-openmpi-1.4.2
use Ray-0.1.0
You use Ray 0.1.0 ! Try Ray 1.0.0, I assure you it has many fixes included.

v. 1.0.0 is the release with the most changes to date.

http://sourceforge.net/apps/mediawik...geLog#v._1.0.0

Quote:
mpirun -np 256 Ray -p $wd\Lunde_1.fq $wd\Lunde_2.fq -o Lunde-contigs
Do you have acces to a SMP machine with 256 processor cores ?!

If so, I envy you.

Quote:
And the output I get:

Code:

Rank 0 welcomes you to the MPI_COMM_WORLD.
Rank 0: website -> http://denovoassembler.sf.net/
Rank 0: using Open-MPI 1.2.7
So basically, you use a bad mix of software: intel-openmpi-1.4.2 with Ray compiled against Open-MPI 1.2.7.

This will surely fail !


Quote:
Rank 0 is running as UNIX process 4193 on s28-2.local (MPI version 2.0)
Last standard is MPI 2.2 from 2009. MPICH2 and Open-MPI 1.4.3 comply with MPI 2.2.

Ray works with MPI 2.0 too, I guess.

Quote:
Rank 0: Ray 0.1.0 is running
As I said, 0.1.0 is defunct. Embrace the new 1.0.0.

The next release is coming soon.

Ray for large genomes is on its way !

My last test on human chromosome 1 (the largest) with one library of
length 200 and another of length 400 shows great success:


Rank 0: 69173 contigs/205904915 nucleotides

Rank 0 reports the elapsed time, Sun Nov 28 20:38:22 2010
---> Step: Collection of fusions
Elapsed time: 1 minutes, 16 seconds
Since beginning: 8 hours, 22 minutes, 4 seconds

Elapsed time for each step, Sun Nov 28 20:38:22 2010

Beginning of computation: 3 seconds
Distribution of sequence reads: 25 minutes, 3 seconds
Distribution of vertices: 1 minutes, 16 seconds
Calculation of coverage distribution: 1 seconds
Distribution of edges: 1 minutes, 30 seconds
Indexing of sequence reads: 2 seconds
Computation of seeds: 10 minutes, 39 seconds
Computation of library sizes: 4 minutes, 51 seconds
Extension of seeds: 7 hours, 33 minutes, 36 seconds
Computation of fusions: 3 minutes, 47 seconds
Collection of fusions: 1 minutes, 16 seconds
Completion of the assembly: 8 hours, 22 minutes, 4 seconds

Rank 0 wrote r4068-human.CoverageDistribution.txt
Rank 0 wrote r4068-human.Library0.txt
Rank 0 wrote r4068-human.Library1.txt
Rank 0 wrote r4068-human.fasta
Rank 0 wrote r4068-human.ReceivedMessages.txt


Quote:
Is there something wrong with my configuration?
You configuration is erroneous in two independent ways.

1. You are using Ray 0.1.0, not Ray 1.0.0.

2. You are running a executable compiled against Open-MPI 1.2.7 with, I believe, Open-MPI 1.4.2.


Thank you for your interest in Ray !

"The Ray of light is coming to life, and the Ray of darkness is fading away."

-Seb
seb567 is offline   Reply With Quote
Old 11-29-2010, 10:31 AM   #59
caddymob
Member
 
Location: USA

Join Date: Apr 2009
Posts: 36
Default

Many thanks seb567!

I had to have my IT department compile and install Ray on our cluster and did not notice they used the old version of Ray. Thanks for pointing this out. I have requested the 1.0.0 version to be installed, along with questions about the MPI version available.

Any idea when the new version will be available for mammalian genomes? Looking forward to it!

I will keep you posted on my progress once I get the new version up and running. Thanks again!
caddymob is offline   Reply With Quote
Old 11-29-2010, 10:42 AM   #60
seb567
Senior Member
 
Location: Québec, Canada

Join Date: Jul 2008
Posts: 260
Default

Quote:
I had to have my IT department compile and install Ray on our cluster and did not notice they used the old version of Ray. Thanks for pointing this out. I have requested the 1.0.0 version to be installed, along with questions about the MPI version available.
I think you are fine with Open-MPI 1.4.2 compiled with Intel Compiler (use intel-openmpi-1.4.2).


Quote:
Any idea when the new version will be available for mammalian genomes? Looking forward to it!
Before Friday, for sure.

Quote:
I will keep you posted on my progress once I get the new version up and running.
Thank you for your updates !
seb567 is offline   Reply With Quote
Reply

Tags
assembler, genome, illumina, mix

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:10 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO