Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • gsRunProcessor: error with v.2.0.00.22

    This question is more along the line of "has anyone else seen this behavior or am I just going crazy?" If you have seen the following then please let know. In a day or two I will dig further into the problem and send a bug report to Roche ... but it would be handy to know that I am not alone.

    -------------------

    We have the 454/Roche newbler software on our "custom compute cluster"; e.g., a box that we did not purchase from Roche. It is a 16 CPU system with lots of memory that is capable of running MPI although I have yet to get it to do so. Instead we have been using gsRunProcessor with "GS_LAUNCH_MODE=MULTI" and "GS_NUM_PROCESSORS=16". This has work great with version 2.0.00.20 of the software.

    Recently we installed the 2.0.00.22 patch which was released on 1-26-2009. Unfortunately this has broken our setup. Using the same MULTI mode that we have been using in the past the .22 software bombs out (after 6 hours of running) with MPI (!) errors. This apparently causes other programs to crash; e.g., the program 'compute_FlowHist1_4'. The error reports items such as:

    application called MPI_Abort(MPI_COMM_WORLD, 1)
    Fatal error in MPI_Barrier: Other MPI error, error stack:
    MPI_Barrier(406)
    MPI_Barrier(comm=0x84000000) failed
    MPIR_Barrier(77)
    MPIC_Sendrecv(126)

    And so on. A really ugly error log especially considering I am not requesting the MPI mode but rather the MULTI launch mode.


    As I mentioned I will try troubleshooting this more in a couple of days after my current runs are processed. In the meantime any words of advice or sympathy?

  • #2
    I know there have been some changes I saw alot of differences in speed between the two versions (using MULTI also)

    I think though that the MULTI uses mpi it is just to indicate that it is multicore mpi vs multi cpu. I have no idea what they do differently though I could think of a ton of stuff that would make sense to do differently based on core to core communication vs cpu to cpu over gigabit.

    I think I would diagnose and get mpi running and that might solve it.

    Comment


    • #3
      failure to load CWF files/runAnalysisPipe

      I've been successful with runAnalysisPipe on some 1/4 Ti signal processing and now I'm trying a full Ti (333 Mb) but end up with the following errors in gsRUnProcessor with both the on and off rig (note: below is --verbose output):

      [Wed Feb 25 09:02:21 2009][Debug][] Logging configured.
      [Wed Feb 25 09:02:21 2009][Information][] gsRunProcessor 2.0.00.22 (Build 184) Starting
      [Wed Feb 25 09:02:21 2009][Debug][] Parsing pipeline: /etc/gsRunProcessor/signalProcessing.xml
      [Wed Feb 25 09:02:21 2009][Debug][ProcessingEngine] Adding step NukeSignalStrengthBalancer (pass 1)
      [Wed Feb 25 09:02:21 2009][Debug][ProcessingEngine] Adding step BlowByCorrector
      [Wed Feb 25 09:02:21 2009][Debug][ProcessingEngine] Adding step CafieCorrector
      [Wed Feb 25 09:02:21 2009][Debug][ProcessingEngine] Adding step NukeSignalStrengthBalancer (pass 2)
      [Wed Feb 25 09:02:21 2009][Debug][ProcessingEngine] Adding step IndividualWellScaler
      [Wed Feb 25 09:02:21 2009][Debug][ProcessingEngine] Adding step MostLikelyErrorSubtractor
      [Wed Feb 25 09:02:21 2009][Debug][ProcessingEngine] Adding step WellScreener (pass 1)
      [Wed Feb 25 09:02:21 2009][Debug][ProcessingEngine] Adding step MetricsGenerator
      [Wed Feb 25 09:02:21 2009][Debug][ProcessingEngine] Adding step QualityFilter
      [Wed Feb 25 09:02:21 2009][Debug][ProcessingEngine] Adding step BaseCaller
      [Wed Feb 25 09:02:29 2009][Information][] Detected processor speed: 2129 MHz.
      [Wed Feb 25 09:02:39 2009][Notice][ProcessingEngine] Starting job eb6dd82e-0344-11de-ad6c-0010182f8ff4.
      [Wed Feb 25 09:02:39 2009][Debug][ProcessingEngine] Creating 1 processing group(s).
      [Wed Feb 25 09:02:39 2009][Information][ProcessingEngine] Using memory-only storage for flowgrams.
      [Wed Feb 25 09:02:39 2009][Notice][ProcessingEngine] Processing Group 0 : Loading data.
      [Wed Feb 25 09:02:39 2009][Information][ProcessingEngine] Opening file /home/kc/Desktop/D_2009_02_20_12_47_04_FLX02070135_imageProcessingOnly/regions/2.cwf
      [Wed Feb 25 09:02:39 2009][Debug][ProcessingEngine] Region 2 : Process 0 is loading 2049,1 4095,4095
      -------
      Any ideas? Roche as recommended a total reinstall of OS and software.

      Comment


      • #4
        It looks like pretty normal output. The only thing I see is that it looks like only one cpu is being used I get more like:
        [Thu Mar 05 23:04:08 2009][Information][] gsRunProcessor 2.0.00.22 (Build 184) Starting[Thu Mar 05 23:04:08 2009][Information][] gsRunProcessor 2.0.00.22 (Build 184) Starting[Thu Mar 05 23:04:08 2009][Information][] gsRunProcessor 2.0.00.22 (Build 184) Starting
        [Thu Mar 05 23:04:08 2009][Information][] gsRunProcessor 2.0.00.22 (Build 184) Starting
        [Thu Mar 05 23:04:08 2009][Information][] gsRunProcessor 2.0.00.22 (Build 184) Starting
        [Thu Mar 05 23:04:08 2009][Information][] gsRunProcessor 2.0.00.22 (Build 184) Starting
        [Thu Mar 05 23:04:08 2009][Information][] gsRunProcessor 2.0.00.22 (Build 184) Starting
        [Thu Mar 05 23:04:11 2009][Information][] Detected processor speed: 2826 MHz.
        [Thu Mar 05 23:04:21 2009][Notice][ProcessingEngine] Starting job 3945dede-0a0c-11de-8556-001d0933401b.
        [Thu Mar 05 23:04:21 2009][Information][ProcessingEngine] Using memory-only storage for flowgrams.


        You might show us your env |grep GS
        and give a little bit on what hardware

        Comment


        • #5
          I agree with Tom. Looks normal to me.

          BTW: As a follow-up to the original message (by myself) in this thread, I did get MPI to work on my computers. This enables MULTI to also work. I'm still a bit irritated that MULTI requires MPI but as long as I can get it to work, hey, that is good enough.

          Comment


          • #6
            many thanks everyone. i heard from GSSupport and this is the solution:
            "edit the ~/.bash_profile so the following environmental variables are set:

            export GS_LAUNCH_MODE=GSRPM
            export GS_CACHEDIR=/data


            On the data rig it should read:

            export GS_LAUNCH_MODE=MULTI
            export GS_CACHEDIR=/data

            The rig state I have of your sequencing machine does not show that these environment variables are set. "
            With re: to hardware, the log is from our FLX instrument (circa Feb 07)

            Comment


            • #7
              Originally posted by engencore View Post
              export GS_CACHEDIR=/data
              If it is not obvious, the GS_CACHEDIR should set to some place where you have lots of temporary (or scratch) space. On my off-data rig this is not "/data" but rather "/scratch/westerm" so I have 'export GS_CACHERDIR=/scratch/westerm'

              Comment


              • #8
                runAnalysisPipe crash

                I've resintalled the 2.0.00.20 SW as root since runAnalysisPipe was not running properly. I have MULTI in the bash_profile and got the following:
                -----------------
                [root@engencorelinux R_2009_02_20_12_45_31_FLX02070135_adminrig_Project6-Sample1 018]# runAnalysisPipe --verbose D_2009_02_20_12_47_04_FLX02070135_imageProcessin gOnly/
                Output files will appear in /root/Desktop/2009_02_20/R_2009_02_20_12_45_31_FLX02 070135_adminrig_Project6-Sample1018/D_2009_03_18_12_26_27_localhost_signalProces sing
                [Debug] Root interfaces: 127.0.0.1:4540|172.20.73.210:4540|
                [Debug] Logging configured.
                [Information] gsRunProcessor 2.0.00.20 (Build 91) Starting
                [Debug] Parsing pipeline: /etc/gsRunProcessor/signalProcessing.xml
                peer[Debug] Trying UDP log connection to root.
                peer[Debug] Confirming UDP log connection to root.
                [Debug] Logging configured.
                [Information] gsRunProcessor 2.0.00.20 (Build 91) Starting
                [Debug] Parsing pipeline: /etc/gsRunProcessor/signalProcessing.xml
                ProcessingEngine[Debug] Adding step NukeSignalStrengthBalancer
                ProcessingEngine[Debug] Adding step BlowByCorrector
                ProcessingEngine[Debug] Adding step CafieCorrector
                ProcessingEngine[Debug] Adding step NukeSignalStrengthBalancer
                ProcessingEngine[Debug] Adding step IndividualWellScaler
                ProcessingEngine[Debug] Adding step MostLikelyErrorSubtractor
                ProcessingEngine[Debug] Adding step WellScreener
                ProcessingEngine[Debug] Adding step MetricsGenerator
                ProcessingEngine[Debug] Adding step QualityFilter
                ProcessingEngine[Debug] Adding step BaseCaller
                ProcessingEngine[Debug] Adding step NukeSignalStrengthBalancer
                ProcessingEngine[Debug] Adding step BlowByCorrector
                ProcessingEngine[Debug] Adding step CafieCorrector
                ProcessingEngine[Debug] Adding step NukeSignalStrengthBalancer
                ProcessingEngine[Debug] Adding step IndividualWellScaler
                ProcessingEngine[Debug] Adding step MostLikelyErrorSubtractor
                ProcessingEngine[Debug] Adding step WellScreener
                ProcessingEngine[Debug] Adding step MetricsGenerator
                ProcessingEngine[Debug] Adding step QualityFilter
                ProcessingEngine[Debug] Adding step BaseCaller
                [Information] Detected processor speed: 2128 MHz.
                ProcessingEngine[Notice] Starting job 880c6af2-13d9-11de-ac7b-0010182f8ff4.
                ProcessingEngine[Debug] Creating 2 processing group(s).
                ProcessingEngine[Debug] Creating 2 processing group(s).
                ProcessingEngine[Information] Using memory-only storage for flowgrams.
                ProcessingEngine[Debug] Rank 0 is member 0 of group 0.
                ProcessingEngine[Notice] Processing Group 0 : Loading data.
                ProcessingEngine[Debug] Rank 1 is member 0 of group 1.
                ProcessingEngine[Notice] Processing Group 1 : Loading data.
                ProcessingEngine[Debug] Opening file /root/Desktop/2009_02_20/R_2009_02_20_12_45 _31_FLX02070135_adminrig_Project6-Sample1018/D_2009_02_20_12_47_04_FLX02070135_i mageProcessingOnly/regions/2.cwf
                ProcessingEngine[Debug] Opening file /root/Desktop/2009_02_20/R_2009_02_20_12_45 _31_FLX02070135_adminrig_Project6-Sample1018/D_2009_02_20_12_47_04_FLX02070135_i mageProcessingOnly/regions/1.cwf
                application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1[cli_1]: aborting job:
                application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
                [Fatal] Processing aborted via SIGINT.
                [Information] Deleting partial results files.
                [0]0:Return code = 0, signaled with Interrupt
                [0]1:Return code = 1


                --------------
                I changed MULTI to SINGLE and it appears to run fine now. Is the problem the eth0 on the datarig?
                thanks,
                joe

                Comment


                • #9
                  If I'm running multi-host MPI (8 threads, 4 on each of 2 machines): Should GC_CACHEDIR be a shared directory visible to both of them, or could it be space that is truly local to the machine (i.e: on a local disk).

                  The latter would be a great way to reduce both network traffic and load on the shared fileserver, provided that it worked right.

                  Comment


                  • #10
                    In regards to GC_CACHEDIR being local or shared, I am not sure and the manual does not seem to address the question. I would suspect that any cache directory could be local. Perhaps the best bet is to try it both ways and see what happens. Unfortunately I do not have a good way to this test out on my machines. Please report back.

                    Comment


                    • #11
                      If I specify GC_CACHEDIR then I see one 8GB file appear in that directory for each thread that is running. It doesn't appear to affect runtime or results whether the cache files are local to the nodes or on a shared filesystem. However, I haven't seen the systems lock up since I started using GC_CACHEDIR, so perhaps it has other benefits in terms of memory usage or something.

                      I find this sort of systems archeology pretty frustrating. "Hey look! this button does this other thing!" It's also a bit worrisome that we don't know whether these options affect scientific results. Has anyone received more detailed assistance from Roche on best practices for running these tools on a cluster? I feel like these questions might best be answered in an advanced version of the user manual.

                      Comment


                      • #12
                        The "Genome Sequencer System Site Preparation Guide, October 2008"
                        says (p.39ff):

                        "GS_CACHEDIR
                        This should be set to the location of a fast local disk.
                        Up to 8GB of temporary files per process could be generated.
                        The default is ‘/data’ if a ‘/data’ directory exists,
                        otherwise /tmp is used."

                        The Roche manuals are not too bad, at least if there are no problems.
                        From my experience many vendors do not supply much background
                        information (in terms of software implementation) about their machines.

                        We are running signal processing on a 32 core system using
                        GS_LAUNCH_MODE=MULTI. I'd say a cache should always
                        be local to the machines the jobs are run on.

                        just my 2p,
                        Sven

                        Comment


                        • #13
                          That's quite interesting. Especially since in some cases overflowing /tmp will bring the system to a screeching halt (our hard lock-up observed above). If the processes are writing up to 8GB apiece there by default they could easily run out of space. Also, since most systems clear /tmp at boot time, the evidence would be gone by the time the system came up clean.

                          Thank you for the tip. I'll dig around and see if one of those manuals got left at the site.

                          Comment


                          • #14
                            I was not aware of GS_CACHEDIR until I filled up /tmp and our server
                            refused continue working ;-)

                            The Roche support pointed me to the Site Preparation Guide, which I think
                            is the wrong place to put this essential information in.

                            Comment


                            • #15
                              engencore,

                              That error looks like an mpi problem, I think, you may want to make sure openMPI is installed and happy

                              Tom

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Advancing Precision Medicine for Rare Diseases in Children
                                by seqadmin




                                Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                                12-16-2024, 07:57 AM
                              • seqadmin
                                Recent Advances in Sequencing Technologies
                                by seqadmin



                                Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                                Long-Read Sequencing
                                Long-read sequencing has seen remarkable advancements,...
                                12-02-2024, 01:49 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 12-17-2024, 10:28 AM
                              0 responses
                              26 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-13-2024, 08:24 AM
                              0 responses
                              42 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-12-2024, 07:41 AM
                              0 responses
                              28 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-11-2024, 07:45 AM
                              0 responses
                              42 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X