Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #46
    I think I found the reason behind all the hanging.

    I myself experienced the hanging with shared memory disabled, using 384 cores (Xeon).

    It is more likely to be a MPI rank being flooded by messages and being unable to response than anything else I believe.

    I am currently testing regularization of message sending in the extension of seeds. Ensuring that a particular number of microseconds between messages is what I am testing.

    If that fails, Ray will simply do the extension of seeds on MPI rank after another.

    In the other steps of the algorithm (distribution of vertices, for example), the messages sent are uniformly observed.

    With the detailed information you provided, I can safely say that running on more cores won't change anything with Ray 0.1.0 and below.

    Thank you.

    Comment


    • #47
      memory consumption and other issues

      Hi everyone!

      We are trying to run Ray on an IntelMPI based Linux (RH) cluster. Recently one of my test jobs crashed, i.e. killed the node it was running on. As we can't access it's run stats I have some questions:
      Given a Ray run is started with ~ 202,136,864 Illumina pe reads (100bp on average) what would be the expected peak memory requirement? Anybody any estimates from their own experience?
      And what does everybody see in terms of runtime for their Ray assemblies? We are testing on one node with 8 cores in the moment, as earlier tests with multiple nodes crashed and took other running procs to hell with them.
      Any help with this would be highly appreciated!
      Btw. anybody out there actually running Ray with IntelMPI?

      Regards,

      Philipp

      Comment


      • #48
        Ray has worked great for our work with Illumina and 454 reads, but is giving us some trouble in our tests on SOLiD data. Our test set is from the NCBI SRA (run SRR035444, submission SRA009283), which can be downloaded either from NCBI or from the SRA at EBI.
        Ray 0.1.0 gets just past the coverage distribution and hangs. The -TheCoverageDistribution.tab file reads:

        #Coverage NumberOfVertices
        255 2
        1431655765 16859178292550

        which seems wrong to me.
        Ray was compiled using OpenMPI 1.5 and gcc 4.5.1. I get this same error using anywhere between 7 and 48 nodes, and it doesn't seem to be a memory issue. If anybody has experienced this sort of thing and/or has a recommendation on how to fix it that would be great.

        Comment


        • #49
          @seb567: You are right, trying to run on a bigger set of nodes (20x8cores) as well as a smaller set of larger memory cores made no difference - the jobs hang within 24hrs of startup and do so using 100% cpu load until killed. If I can provide any further debug info, let me know...

          @PHSchi: Im using openMPI 1.4.2 but for approx 500M 100bp paired Illumina reads, random checking on nodes suggests that the total memory usage was 400-600GB, but thats just a finger in the wind estimate and the jobs got stuck so I cant say for sure. The target genome is estimated at around 1.3Gbases. So a very linear guess is that you need at least half of this, that is 2-300GB

          cheers
          pallo

          Comment


          • #50
            Ray 1.0.0 is compliant the standard MPI 2.2 !

            Warning: long post ahead.

            Statements:

            1. Ray 0.1.0 and before were not 100 % compatible with the standard MPI 2.2. Thus, Ray hanged sometimes.

            2. Ray 1.0.0 is compliant with the standard MPI 2.2.

            3. Ray 1.0.0 __SHOULD__ not hang.

            4. Ray 1.0.0 is released.


            Now, let me answer your questions.


            @seb567 (self) 11-09-2010, 06:01 AM

            I think I found the reason behind all the hanging.

            I myself experienced the hanging with shared memory disabled, using 384 cores (Xeon).



            It is more likely to be a MPI rank being flooded by messages and being unable to response than anything else I believe.
            As George Bosilca puts it:

            No message is eager if there is congestion. 64K is eager for TCP only if the kernel buffer has enough room to hold the 64k. For SM it only works if there are ready buffers. In fact, eager is an optimization of the MPI library, not something the users should be aware of, or base their application on this particular behavior.

            On the MPI 2.2 there is a specific paragraph that advice the users not to do it.

            I am currently testing regularization of message sending in the extension of seeds. Ensuring that a particular number of microseconds between messages is what I am testing.
            That failed.

            If that fails, Ray will simply do the extension of seeds on MPI rank after another.
            That was not fast, and failed with MPICH2.

            The ultimate solution was to read the standard MPI 2.2.


            Warning: 647 pages, very technical.

            In the other steps of the algorithm (distribution of vertices, for example), the messages sent are uniformly observed.
            But still, MPI_Send can block !

            Note that MPI_Send was replaced with MPI_Isend in Ray 1.0.0.

            With the detailed information you provided, I can safely say that running on more cores won't change anything with Ray 0.1.0 and below.

            Thank you.
            Ray 1.0.0 is compliant with MPI 2.2 and should not hang.



            @PHSchi 11-19-2010, 04:26 AM

            Hi everyone!

            We are trying to run Ray on an IntelMPI based Linux (RH) cluster. Recently one of my test jobs crashed, i.e. killed the node it was running on.
            IntelMPI is based, I believe, on MPICH2. Thus, Ray 1.0.0 will works fine, but not previous versions.

            As we can't access it's run stats I have some questions:
            If you start your jobs with qsub (Oracle/Sun Grid Engine), try to modify and run qhost.py, which is readily available in scripts/ from the Ray 1.0.0 distribution. The script uses 'qhost -j -xml>dump.xml' and then parse the XML file.

            Given a Ray run is started with ~ 202,136,864 Illumina pe reads (100bp on average) what would be the expected peak memory requirement? Anybody any estimates from their own experience?
            Memory usage depends mainly on the genome size and error rates.

            And what does everybody see in terms of runtime for their Ray assemblies?
            End-users working with bacterial data are satisfied.

            I don't know for others.

            We are testing on one node with 8 cores in the moment, as earlier tests with multiple nodes crashed and took other running procs to hell with them.
            8 cores sound low for ~ 202,136,864 Illumina pe reads.

            Any help with this would be highly appreciated!
            I hope Ray 1.0.0 works for you!


            Btw. anybody out there actually running Ray with IntelMPI?

            Regards,

            Philipp
            As I wrote, IntelMPI is based on MPICH2.
            see http://www.mcs.anl.gov/research/proj...x.php?s=collab

            With Ray 1.0.0, IntelMPI should work fine. And that should be true with g++ and icc.


            @mrawlins 11-23-2010, 09:31 AM


            Ray has worked great for our work with Illumina and 454 reads,
            Yes, mixing technologies eliminates 454 homopolymer errors and Illumina shorter read length.

            but is giving us some trouble in our tests on SOLiD data. Our test set is from the NCBI SRA (run SRR035444, submission SRA009283), which can be downloaded either from NCBI or from the SRA at EBI.
            I added a ticket, but my tests with public datasets from solidsoftwaretools indicated that the error rate of this technology does not allow a de novo assembly with Ray.

            For instance, with k=21, you probably want the error (substitution) rate to be below 1/21. Otherwise any k-mer will be erroneous, and thus unique !

            1 / 21 = 0,0476190476 = 4.76 %

            If I remember well, error rates for these datasets were above that (~12 % or so, I think).



            Datasets are:

            SOLiD™4 System E.Coli DH10B Fragment Data Set
            SOLiD™ System E.Coli DH10B 50X50 Mate-Pair Data Set


            Ray 0.1.0 gets just past the coverage distribution and hangs. The -TheCoverageDistribution.tab file reads:
            #Coverage NumberOfVertices
            255 2
            1431655765 16859178292550
            I think that does not mean anything. 1431655765 is just not possible because the maximum value is 255.

            Can you try again with Ray 1.0.0 and post/send me the results ?

            which seems wrong to me.
            You are not alone.

            Ray was compiled using OpenMPI 1.5 and gcc 4.5.1.
            You are better off with Open-MPI 1.4.3 or MPICH2 1.3.1 or any other super-stable releases. Open-MPI 1.5 is a beta 'feature release'.

            I get this same error using anywhere between 7 and 48 nodes, and it doesn't seem to be a memory issue.
            I would bet on an error rate above 1/k. Try

            mpirun -np 40 -k 15 -p dataLEFT.fastq.bz2 dataRIGHT.fastq.gz

            Supposing that your genome/transcriptome size is far below 1 073 741 824.

            Code:
            4^15 =                    1 073 741 824
            4^21 =              4 398 046 511 104
            4^32 = 18 446 744 073 709 551 616
            If anybody has experienced this sort of thing and/or has a recommendation on how to fix it that would be great.
            Well, again, my tests on the datasets from http://solidsoftwaretools.com/ indicated that the error rate of the SOLiD technology is not friendly with de novo assembly with Ray.

            Let us hope that 'Exact Call Chemistry' will fix that.

            Thermo Fisher Scientific enables our customers to make the world healthier, cleaner and safer. Delivering technology, pharmaceutical and biotechnology services.





            @pallo Yesterday, 12:29 AM


            @seb567: You are right, trying to run on a bigger set of nodes (20x8cores) as well as a smaller set of larger memory cores made no difference - the jobs hang within 24hrs of startup and do so using 100% cpu load until killed. If I can provide any further debug info, let me know...
            Can you try with Ray 1.0.0 as it is compliant with the standard MPI 2.2 ?

            I replaced MPI_Send with MPI_Isend, and I carefully added some sort of busy-waiting before sending additional messages. Note that I say 'some sort' because an MPI rank can still receive MPI messages while waiting.

            Also, I removed calls to MPI_Iprobe, and I replaced them with a ring of 128 bins of MPI requests that are MPI_Recv_init'ed & MPI_Start'ed at the start of computation.

            Credit for this idea goes to George Bosilca (University of Tennessee & MPI/Open-MPI researcher/scientist).




            @PHSchi: Im using openMPI 1.4.2 but for approx 500M 100bp paired Illumina reads, random checking on nodes suggests that the total memory usage was 400-600GB, but thats just a finger in the wind estimate and the jobs got stuck so I cant say for sure. The target genome is estimated at around 1.3Gbases.
            Given the genome size and the presence of errors, I must agree with your estimate.

            In an MPI rank provide you with 3 gigabytes of memory, then you need around 200 MPI ranks.

            Code:
            600 / 3 = 200
            Contrary to ABySS, which uses google-sparsehash to store data on disk --at least that was true the last time I checked, Ray stores everything in memory.



            So a very linear guess is that you need at least half of this, that is 2-300GB

            cheers
            pallo

            For sure you can't buy that if you work in a laboratory.

            However, in the United States of America, the National Center for Computational Sciences provides resources to scientists.




            In Canada, Compute Canada/Calcul Canada (on parle français et anglais !) provides compute resources to scientists.



            Acknowledgment for Ray 1.0.0

            Élénie Godzaridis (Institut de biologie intégrative et des systèmes de l'Université Laval) for suggesting using End of transmission to pack sequences & suggesting using enum for constants.

            George Bosilca (University of Tennessee) for MPI_Recv_init/MPI_Start and for pointing out that MPI_Send can block even below the eager threshold.

            Jeff Squyres (Cisco) for pointing out that MPI_Send to self is not safe and that MPI_Request_free on an active request is evil.

            Eugene Loh (Oracle) for the correct eager threshold (4000 bytes, not 4096 bytes).

            René Paradis (Centre de recherche du CHUL) for giving me a good-old Sun
            Blade 100 (SPARC V9, TI UltraSparc IIe (Hummingbird) & for maintaining my testing boxes.

            Torsten Seemann (Victorian Bioinformatics Consortium, Dept. Microbiology, Monash University, AUSTRALIA) for suggesting that Ray should load interleaved files and GZIP-compressed files.

            Frédéric Lefebvre (CLUMEQ - Université Laval) for installing software on the mighty colosse. http://www.top500.org/system/10195

            The Canadian Institutes of Health Research for my scholarship.


            ChangeLog for Ray 1.0.0

            v. 1.0.0

            r4038 | 2010-11-25

            * Made a lots of changes to make Ray compliant with the standard MPI 2.2
            * Added master and slave modes.
            * Added an array of master methods (pointers): selecting the master method
            with the master mode is done in O(1).
            * Added an array of slave methods (pointers): selecting the slave method
            with the master mode is done in O(1).
            * Added an array of message handlers (pointers): selecting the message handler method
            with the message tag is done in O(1).
            * Replaced MPI_Send by MPI_Isend. Thanks to Open-MPI developpers for their
            support and explanation on the eagerness of Open-MPI: George Bosilca (University of Tennessee), Jeff Squyres (Cisco), Eugene Loh (Oracle)
            * Moved some code for the extension of seeds.
            * Grouped messages for library updates.
            * Added support for paired-end interleaved sequencing reads (-i option)
            Thanks to Dr. Torsten Seemann (Victorian Bioinformatics Consortium, Dept. Microbiology, Monash University, AUSTRALIA) for suggesting the feature !
            * Moved detectDistances & updateDistances in their own C++ file.
            * Updated the Wiki.
            * Decided that the next release was 1.0.0.
            * Added support for .fasta.gz and .fastq.gz files, using libz (GZIP).
            Thanks to Dr. Torsten Seemann (Victorian Bioinformatics Consortium, Dept. Microbiology, Monash University, AUSTRALIA) for suggesting the feature !
            * Tested with k=17: it uses less memory, but is less precise.
            * Fixed a memory allocation bug when the code runs on 512 cores and more.
            * Added configure script using automake & autoconf.
            Note that if that fails, read the INSTALL file !
            * Moved the code that loads fasta files in FastaLoader.
            * Moved the code that loads fastq files in FastqLoader.
            * Regulated the communication in the MPI 'tribe'.
            * Added an assertion to verify the message buffer length before sending it.
            * Modified bits so that if a message is more than 4096 bytes, split it in
            chunks.
            * Used a sentinel to remove two messages, coupled with TAG_REQUEST_READS.
            * Stress-tested with MPICH2.
            * Implemented a ring allocator for inboxes and outboxes.
            * Changed flushing so that all use <flush> & <flushAll> in BufferedData.
            * Changed the maximum message size from 4096 to 4000 to send messages eagerly
            more often (if it happens). Thanks to Open-MPI developpers for their support and explanation on the eagerness of Open-MPI: Eugene Loh (Oracle), George Bosilca (University of Tennessee), Jeff Squyres (Cisco).
            * Changed the way sequencing reads are indexed: before the master was
            reloading (again !) files to do so, now no files are loaded and every MPI ranks participate in the task.
            * Modified the way sequences are distributed. These are now appended to fill the buffer, and
            the sentinel called 'End of transmission' is used. Thanks to Élénie Godzaridis for pointing out that '\0' is not a valid sentinel for strings !
            * Optimized the flushing in BufferedData: flush is now destination-specific.
            O(1) instead of O(n) where n is the number of MPI ranks.
            * Optimized the extension: paired information is appended in the buffer in
            which the sequence itself is.
            * Added support for .fasta.bz2 & .fastq.bz2. This needs LIBBZ2 (-lbz2)
            * Added instructions in the INSTALL file for manually compiling the source in
            case the configure script gets tricky (cat INSTALL).
            * Added a received messages file. This is pretty useless unless you want to
            see if the received messages are uniform !.
            * Added bits to write the fragment length distribution of each library.
            * Changed the definition of MPI tags: they are now defined with a enum.
            Thanks to Élénie Godzaridis for the suggestion.
            * Changed the definition of slave modes: they are now defined with a enum.
            Thanks to Élénie Godzaridis for the suggestion.
            * Changed the definition of master modes: they are now defined with a enum.
            Thanks to Élénie Godzaridis for the suggestion.
            * Optimized finishFusions: complexity changed from O(N*M) to O(N log M).
            * Designed a beautiful logo with Inkscape.
            * Added a script for regression tests.
            * Changed bits so that a paired read is not updated if it does not need it
            * Changed the meaning of the -o parameter: it is now a prefix.
            * Added examples with MPICH2, Open-MPI, and Open-MPI/SunGridEngine.
            * Changed DEBUG for ASSERT as it activates assertions.
            * Updated the citation in the standard output.
            * Corrected the interleave-fastq python script.
            * Changed the license file from LICENSE to COPYING.
            * Removed the trimming of reads if they are not read from a file.
            * Increased the verbosity of the extension step.
            * Added gnuplot scripts.
            * Changed the file name for changes: from NEWS to ChangeLog.
            * Optimized the MPI layer: replaced MPI_Iprobe by MPI_Recv_init+MPI_Start.
            see MessagesHandler.cpp ! (Thanks to George Bosilca (University of Tennessee) for the suggestion !
            * Compiled and tested on architecture SPARC V9 (sparc64).
            * Compiled and tested on architecture Intel Itanium (ia64).
            * Compiled and tested on architecture Intel64 (x86_64).
            * Compiled and tested on architecture AMD64 (x86_64).
            * Compiled and tested on Intel architecture (x86/ia32).
            * Evaluated regression tests.

            Comment


            • #51
              Ray 1.0.0 doesn't load reads

              Hi Seb,

              I compiled Ray version 1.0.0 today (openMPI 1.4.3, gcc 4.5.1, Fedora 14) and when I run the new executable it stops at loading the first of the paired end Solexa reads (Rank 0 loads nameoffile) and exits. When I use the previous Ray version (0.1.0) with the same command line on the same dataset it runs fine. Tried compiling it twice but got the same result.

              Jean-Francois Pombert

              2x Xeon E5506
              96G RAM
              Intel Server Board S5520HC
              Linux kernel 2.6.35.6-48

              Comment


              • #52
                Ray 1.0.0 doesn't load reads
                Hi Seb,

                I compiled Ray version 1.0.0 today (openMPI 1.4.3, gcc 4.5.1, Fedora 14) and when I run the new executable it stops at loading the first of the paired end Solexa reads (Rank 0 loads nameoffile) and exits. When I use the previous Ray version (0.1.0) with the same command line on the same dataset it runs fine. Tried compiling it twice but got the same result.

                Jean-Francois Pombert

                2x Xeon E5506
                96G RAM
                Intel Server Board S5520HC
                Linux kernel 2.6.35.6-48
                Can you provide more details (by email if you wish) ?

                The module for loading sequences from files have not changed much, but the distribution of sequences has.

                However, I have not seen that glitch.

                Comment


                • #53
                  Here is the console log. I`ll look again at the compilation. I might have goofed somehow.

                  Thx

                  JF

                  ****************************************
                  [David@bigdaddy Ray]$ mpirun -np 8 Ray -p 100420_s_7_1_seq_GKD-1.txt 100420_s_7_2_seq_GKD-1.txt -s FQH37LX05.sff -s FQH37LX06.sff -s FTX7HMM01.sff -s FU6LJ3H01.sff -s FWZEL0L06.sff -o test.txt
                  Bienvenue !

                  Rank 0: Ray 1.0.05.sff -s FQH37LX06.sff -s FTX7HMM01.sff -s FU6LJ3H01.sff -s FWZEL0L06.sffRank 0: compiled with Open-MPI 1.4.3
                  seq_GKD-1.txt -s FQH37LX05.sff -s FQH37LX06.sff -s FTX7HMM01.sff -s FU6LJ3H01.sff
                  Rank 0 reports the elapsed time, Thu Nov 25 17:48:37 2010HMM01.sff -s FU6LJ3H01. ---> Step: Beginning of computation
                  Elapsed time: 1 seconds
                  Since beginning: 1 seconds


                  **************************************************
                  This program comes with ABSOLUTELY NO WARRANTY.
                  This is free software, and you are welcome to redistribute it
                  under certain conditions; see "COPYING" for details.
                  **************************************************

                  Ray Copyright (C) 2010 Sébastien Boisvert, Jacques Corbeil, François Laviolette
                  Centre de recherche en infectiologie de l'Université Laval
                  Project funded by the Canadian Institutes of Health Research (Doctoral award 200902CGM-204212-172830 to S.B.)


                  Reference to cite:

                  Sébastien Boisvert, François Laviolette & Jacques Corbeil.
                  Ray: simultaneous assembly of reads from a mix of high-throughput sequencing technologies.
                  Journal of Computational Biology (Mary Ann Liebert, Inc. publishers, New York, U.S.A.).
                  November 2010, Volume 17, Issue 11, Pages 1519-1533.
                  doi:10.1089/cmb.2009.0238


                  Rank 0 welcomes you to the MPI_COMM_WORLD
                  Rank 0 is running as UNIX process 18016 on bigdaddy
                  Rank 2 is running as UNIX process 18018 on bigdaddy
                  Rank 3 is running as UNIX process 18019 on bigdaddy
                  Rank 5 is running as UNIX process 18021 on bigdaddy
                  Rank 1 is running as UNIX process 18017 on bigdaddy
                  Rank 4 is running as UNIX process 18020 on bigdaddy
                  Rank 7 is running as UNIX process 18023 on bigdaddy
                  Rank 0: I am the master among 8 ranks in the MPI_COMM_WORLD.

                  Ray command:

                  Ray \
                  -p \
                  100420_s_7_1_seq_GKD-1.txt \
                  100420_s_7_2_seq_GKD-1.txt \
                  -s \
                  FQH37LX05.sff \
                  -s \
                  FQH37LX06.sff \
                  -s \
                  FTX7HMM01.sff \
                  -s \
                  FU6LJ3H01.sff \
                  -s \
                  FWZEL0L06.sff \
                  -o \
                  test.txt

                  -p (paired-end sequences)
                  Left sequences: 100420_s_7_1_seq_GKD-1.txt
                  Right sequences: 100420_s_7_2_seq_GKD-1.txt
                  Average length: auto
                  Standard deviation: auto

                  -s (single sequences)
                  Sequences: FQH37LX05.sff

                  -s (single sequences)
                  Sequences: FQH37LX06.sff

                  -s (single sequences)
                  Sequences: FTX7HMM01.sff

                  -s (single sequences)
                  Sequences: FU6LJ3H01.sff

                  -s (single sequences)
                  Sequences: FWZEL0L06.sff

                  k-mer size: 21
                  --> Number of k-mers of size 21: 4398046511104
                  *** Note: A lower k-mer size bounds the memory usage. ***

                  Rank 0 is loading 100420_s_7_1_seq_GKD-1.txt
                  Rank 6 is running as UNIX process 18022 on bigdaddy
                  [David@bigdaddy Ray]$
                  ********************************************************

                  Comment


                  • #54
                    Très cher Jean-Francois Pombert,


                    Thank you for your timely answer.

                    In Ray 0.1.0 and before, fasta and fastq were detected using the first line in the file.

                    In Ray 1.0.0, I solely use the file extension to select the appropriate loader.

                    Ray \
                    -p \
                    100420_s_7_1_seq_GKD-1.txt \
                    100420_s_7_2_seq_GKD-1.txt \
                    -s \
                    FQH37LX05.sff \
                    -s \
                    FQH37LX06.sff \
                    -s \
                    FTX7HMM01.sff \
                    -s \
                    FU6LJ3H01.sff \
                    -s \
                    FWZEL0L06.sff \
                    -o \
                    test.txt
                    So, Ray does not know what to do with .txt files and just stops.

                    Usage:

                    Supported sequences file format:
                    .fasta
                    .fasta.gz
                    .fasta.bz2
                    .fastq
                    .fastq.gz
                    .fastq.bz2
                    .sff (paired reads must be extracted manually)


                    Parameters:

                    Single-end reads
                    -s <sequencesFile>

                    Paired-end reads:
                    -p <leftSequencesFile> <rightSequencesFile> [ <fragmentLength> <standardDeviation> ]

                    Paired-end reads:
                    -i <interleavedFile> [ <fragmentLength> <standardDeviation> ]

                    Output (default: Ray-Contigs.fasta)
                    -o <outputFile>

                    AMOS output
                    -a

                    k-mer size (default: 21)
                    -k <kmerSize>


                    I will add a specific message to alarm the user about the extension.

                    Thank you for your interest in Ray !

                    Comment


                    • #55
                      @jfpombert

                      I forgot to provide a fix.

                      quick fix:

                      ln -s 100420_s_7_1_seq_GKD-1.txt 100420_s_7_1_seq_GKD-1.txt.fastq

                      ln -s 100420_s_7_2_seq_GKD-1.txt 100420_s_7_2_seq_GKD-1.txt.fastq

                      mpirun -np 8 \
                      Ray \
                      -p \
                      100420_s_7_1_seq_GKD-1.txt.fastq \
                      100420_s_7_2_seq_GKD-1.txt.fastq \
                      -s \
                      FQH37LX05.sff \
                      -s \
                      FQH37LX06.sff \
                      -s \
                      FTX7HMM01.sff \
                      -s \
                      FU6LJ3H01.sff \
                      -s \
                      FWZEL0L06.sff \
                      -o \
                      test.txt
                      or using bzip2, you will save precious space:

                      bzip2<100420_s_7_1_seq_GKD-1.txt>100420_s_7_1_seq_GKD-1.txt.fastq.bz2

                      bzip2<100420_s_7_2_seq_GKD-1.txt>100420_s_7_2_seq_GKD-1.txt.fastq.bz2

                      Ray \
                      -p \
                      100420_s_7_1_seq_GKD-1.txt.fastq.bz2 \
                      100420_s_7_2_seq_GKD-1.txt.fastq.bz2 \
                      -s \
                      FQH37LX05.sff \
                      -s \
                      FQH37LX06.sff \
                      -s \
                      FTX7HMM01.sff \
                      -s \
                      FU6LJ3H01.sff \
                      -s \
                      FWZEL0L06.sff \
                      -o \
                      test.txt

                      Thank you for providing a detailed report of what you did.

                      Comment


                      • #56
                        Ok, great, i'll just change the extensions.

                        Un gros merci!

                        JF

                        Comment


                        • #57
                          processes aborted

                          Really excited to try out Ray! I first tried to grab the example datasets, but the links are dead.. getting a 550 error, no such file..

                          So then I went ahead and tried to assemble my own genome of a very homozygus (>96%) mammalian genome sequenced on illumina with paired 105bp reads. Ray is failing and I do not understand why.

                          Ray ran for just over 2 hours on 256 cores before dying. Here are my commands:

                          Code:
                          use intel-openmpi-1.4.2
                          use Ray-0.1.0
                          
                          mpirun -np 256 Ray -p $wd\Lunde_1.fq $wd\Lunde_2.fq -o Lunde-contigs
                          And the output I get:

                          Code:
                          Rank 0 welcomes you to the MPI_COMM_WORLD.
                          Rank 0: website -> http://denovoassembler.sf.net/
                          Rank 0: using Open-MPI 1.2.7
                          Rank 0 is running as UNIX process 4193 on s28-2.local (MPI version 2.0)
                          .
                          .
                          .
                          Rank 243 is running as UNIX process 4231 on s26-4.local (MPI version 2.0)
                          Rank 0: I am the master among 256 ranks in the MPI_COMM_WORLD.
                          
                          Rank 0: Ray 0.1.0 is running
                          Rank 0: operating system is Linux (during compilation)
                          
                          LoadPairedEndReads
                           Left sequences: /scratch/jcorneveaux/LUNDE_ASSEMBLE/Lunde_1.fq
                           Right sequences: /scratch/jcorneveaux/LUNDE_ASSEMBLE/Lunde_2.fq
                           Average length: auto
                           Standard deviation: auto
                          
                          k-mer size: 21
                           --> Number of k-mers of size 21: 4398046511104
                            *** Note: A lower k-mer size bounds the memory usage. ***
                          
                          
                          Rank 0 loads /scratch/jcorneveaux/LUNDE_ASSEMBLE/Lunde_1.fq.
                          Rank 0 has 140174250 sequences to distribute.
                          Rank 0 distributes sequences, 1/140174250
                          mpirun noticed that job rank 1 with PID 4194 on node s28-2 exited on signal 15 (Terminated). 
                          254 additional processes aborted (not shown)
                          1 process killed (possibly by Open MPI)Rank 0 welcomes you to the MPI_COMM_WORLD.
                          Rank 0: website -> http://denovoassembler.sf.net/
                          Rank 0: using Open-MPI 1.2.7
                          Rank 0 is running as UNIX process 4193 on s28-2.local (MPI version 2.0)
                          Is there something wrong with my configuration?

                          Comment


                          • #58
                            Really excited to try out Ray! I first tried to grab the example datasets, but the links are dead.. getting a 550 error, no such file..
                            NCBI moved their infrastructure from .fastq to .sra files.

                            My favorite toy dataset is SRA001125, Illumina data of E. coli K-12 MG1655.

                            Search SRA001125 and you'll find it.

                            So then I went ahead and tried to assemble my own genome of a very homozygus (>96%) mammalian genome sequenced on illumina with paired 105bp reads. Ray is failing and I do not understand why.
                            Many reasons can explain that.



                            Ray ran for just over 2 hours on 256 cores before dying. Here are my commands:



                            use intel-openmpi-1.4.2
                            use Ray-0.1.0
                            You use Ray 0.1.0 ! Try Ray 1.0.0, I assure you it has many fixes included.

                            v. 1.0.0 is the release with the most changes to date.

                            Download Ray: scalable assembly for free. Ray -- Parallel genome assemblies for parallel DNA sequencing . de novo genome assembly is now a challenge because of the overwhelming amount of data produced by sequencers. Ray assembles reads obtained with new sequencing technologies (Illumina, 454, SOLiD) using MPI 2.2 -- a message passing inferface standard.


                            mpirun -np 256 Ray -p $wd\Lunde_1.fq $wd\Lunde_2.fq -o Lunde-contigs
                            Do you have acces to a SMP machine with 256 processor cores ?!

                            If so, I envy you.

                            And the output I get:

                            Code:

                            Rank 0 welcomes you to the MPI_COMM_WORLD.
                            Rank 0: website -> http://denovoassembler.sf.net/
                            Rank 0: using Open-MPI 1.2.7
                            So basically, you use a bad mix of software: intel-openmpi-1.4.2 with Ray compiled against Open-MPI 1.2.7.

                            This will surely fail !


                            Rank 0 is running as UNIX process 4193 on s28-2.local (MPI version 2.0)
                            Last standard is MPI 2.2 from 2009. MPICH2 and Open-MPI 1.4.3 comply with MPI 2.2.

                            Ray works with MPI 2.0 too, I guess.

                            Rank 0: Ray 0.1.0 is running
                            As I said, 0.1.0 is defunct. Embrace the new 1.0.0.

                            The next release is coming soon.

                            Ray for large genomes is on its way !

                            My last test on human chromosome 1 (the largest) with one library of
                            length 200 and another of length 400 shows great success:


                            Rank 0: 69173 contigs/205904915 nucleotides

                            Rank 0 reports the elapsed time, Sun Nov 28 20:38:22 2010
                            ---> Step: Collection of fusions
                            Elapsed time: 1 minutes, 16 seconds
                            Since beginning: 8 hours, 22 minutes, 4 seconds

                            Elapsed time for each step, Sun Nov 28 20:38:22 2010

                            Beginning of computation: 3 seconds
                            Distribution of sequence reads: 25 minutes, 3 seconds
                            Distribution of vertices: 1 minutes, 16 seconds
                            Calculation of coverage distribution: 1 seconds
                            Distribution of edges: 1 minutes, 30 seconds
                            Indexing of sequence reads: 2 seconds
                            Computation of seeds: 10 minutes, 39 seconds
                            Computation of library sizes: 4 minutes, 51 seconds
                            Extension of seeds: 7 hours, 33 minutes, 36 seconds
                            Computation of fusions: 3 minutes, 47 seconds
                            Collection of fusions: 1 minutes, 16 seconds
                            Completion of the assembly: 8 hours, 22 minutes, 4 seconds

                            Rank 0 wrote r4068-human.CoverageDistribution.txt
                            Rank 0 wrote r4068-human.Library0.txt
                            Rank 0 wrote r4068-human.Library1.txt
                            Rank 0 wrote r4068-human.fasta
                            Rank 0 wrote r4068-human.ReceivedMessages.txt


                            Is there something wrong with my configuration?
                            You configuration is erroneous in two independent ways.

                            1. You are using Ray 0.1.0, not Ray 1.0.0.

                            2. You are running a executable compiled against Open-MPI 1.2.7 with, I believe, Open-MPI 1.4.2.


                            Thank you for your interest in Ray !

                            "The Ray of light is coming to life, and the Ray of darkness is fading away."

                            -Seb

                            Comment


                            • #59
                              Many thanks seb567!

                              I had to have my IT department compile and install Ray on our cluster and did not notice they used the old version of Ray. Thanks for pointing this out. I have requested the 1.0.0 version to be installed, along with questions about the MPI version available.

                              Any idea when the new version will be available for mammalian genomes? Looking forward to it!

                              I will keep you posted on my progress once I get the new version up and running. Thanks again!

                              Comment


                              • #60
                                I had to have my IT department compile and install Ray on our cluster and did not notice they used the old version of Ray. Thanks for pointing this out. I have requested the 1.0.0 version to be installed, along with questions about the MPI version available.
                                I think you are fine with Open-MPI 1.4.2 compiled with Intel Compiler (use intel-openmpi-1.4.2).


                                Any idea when the new version will be available for mammalian genomes? Looking forward to it!
                                Before Friday, for sure.

                                I will keep you posted on my progress once I get the new version up and running.
                                Thank you for your updates !

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM
                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                30 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                32 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                28 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-04-2024, 09:00 AM
                                0 responses
                                53 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X