Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • GATK Queue not running jobs in parallel

    Hi all, I've just started experimenting with using GATK-Queue to run through a GATK pipeline with some scatter/gather parallelism. I'm none too familiar with Java/Scala, but I think the script I've written should work fine.

    When I run this script with Queue, it seems to work fine (I've only run it on tiny test BAMs so far), but the jobs don't run in parallel. For all the tasks where I've specified a scatterCount, the task is correctly split into four jobs, but those jobs are then executed one after the other rather than in parallel.

    Has anyone encountered anything like this in GATK-Queue before? Is there some command-line switch or line of code I need to input into the script to make it run jobs in parallel? I've looked through the docs and the -help and everything seemed to suggest parallelization should just work out of the box.

    EDIT: Really should have looked at the outputs more closely before I posted this - it seems only some of the sections aren't running in parallel. I'll fiddle with the script and report back if I can figure out what's going on.

    Command line:
    Code:
    java -Djava.io.tmpdir=/scratch/queuetmp -jar /resources/Sting/dist/Queue.jar -S /apps/pipeline/script.queue -I /scratch/tmp/testsample.dedup.bam -G /resources/genome/human_g1k_v37.fa -run
    Script:
    Code:
    package org.broadinstitute.sting.queue.qscripts.examples
    
    import org.broadinstitute.sting.queue.QScript
    import org.broadinstitute.sting.queue.extensions.gatk._
    
    
    class MyPipeline extends QScript {
    	
    	qscript =>
    
    	@Input(doc="Reference genome.", shortName="G")
    	var referenceFile: File = _
    
    	@Input(doc="Bam file to process.", shortName="I")
    	var bamFile: File = _
    
    
    	trait MemoryLimitAndReference extends CommandLineGATK {
    	this.reference_sequence = qscript.referenceFile
    	this.memoryLimit = 2
    	}
    	
    	class SamtoolsIndex extends CommandLineFunction {
    	@Input(doc="input file")
    	var inputFile: File = _
    	def commandLine = "samtools index " + inputFile
    	}
    
    	def script() {
    	 val realignertargetcreator  = new RealignerTargetCreator with MemoryLimitAndReference
    	 val indelrealigner = new IndelRealigner with MemoryLimitAndReference
    	 val countcovariates = new CountCovariates with MemoryLimitAndReference
    	 val countpostrecalcovariates = new CountCovariates with MemoryLimitAndReference
    	 val tablerecalibration = new TableRecalibration with MemoryLimitAndReference
    	 val samtoolsindex = new SamtoolsIndex
    
    	
    	 realignertargetcreator.input_file :+= qscript.bamFile
    	 realignertargetcreator.scatterCount = 4
    	 realignertargetcreator.known = List("/resources/rods/1000G_biallelic.indels.b37.vcf")
    	 realignertargetcreator.out = swapExt(qscript.bamFile, "bam", "realignment.intervals")
    	
    	 indelrealigner.input_file :+= qscript.bamFile
    	 indelrealigner.targetIntervals = realignertargetcreator.out
    	 indelrealigner.scatterCount = 4
    	 indelrealigner.known = realignertargetcreator.known
    	 indelrealigner.consensusDeterminationModel = org.broadinstitute.sting.gatk.walkers.indels.IndelRealigner.ConsensusDeterminationModel.USE_SW
    	 indelrealigner.out = swapExt(qscript.bamFile, "bam", "realigned.bam")
    	 
    	 samtoolsindex.inputFile = indelrealigner.out
    	 
    	 countcovariates.scatterCount = 4
    	 countcovariates.input_file :+= indelrealigner.out
    	 countcovariates.recal_file = swapExt(indelrealigner.out, "bam", "covariatecount.csv")
    	 countcovariates.standard_covs = true
    	 countcovariates.knownSites = List("/resources/rods/dbsnp_132.b37.vcf")
    	
    	 tablerecalibration.scatterCount = 4
    	 tablerecalibration.input_file :+= indelrealigner.out
    	 tablerecalibration.recal_file = countcovariates.recal_file
    	 tablerecalibration.out = swapExt(indelrealigner.out, "bam", "recal.bam")
    
    	 countpostrecalcovariates.scatterCount = 4
    	 countpostrecalcovariates.input_file :+= tablerecalibration.out
    	 countpostrecalcovariates.recal_file = swapExt(indelrealigner.out, "bam", "postrecalcounts.csv")
    	 countpostrecalcovariates.standard_covs = true
    	 countpostrecalcovariates.knownSites = List("/resources/rods/dbsnp_132.b37.vcf")
    	
    	 add(realignertargetcreator,indelrealigner, samtoolsindex, countcovariates, tablerecalibration, countpostrecalcovariates)
    	}
    }
    Last edited by Rocketknight; 03-01-2012, 09:34 AM.

  • #2
    Okay, after more investigation I'm still clueless. The critical function that isn't parallelising properly is RealignerTargetCreator. The scatter is done correctly, but then the individual jobs are only done one at a time. Any advice?

    Comment


    • #3
      Have you tried the -jobRunner option and a supported batch queuing system?:



      "The job runner to dispatch jobs. Setting to Lsf706, GridEngine, or Drmaa will dispatch jobs to LSF or Grid Engine using the job settings (see below). Defaults to Shell which runs jobs on a local shell one at a time."

      You might want to try the GridEngine support:



      You don't actually need a multi-node grid to install SGE and Drmaa (and they should be standard packages in Linux distro repositories) - GATK Queue will run with them on a single box with multiple cores. SGE is a bit complicated to configure, though!

      Comment


      • #4
        Aha, good find. That seems to suggest that not setting a job runner means jobs will only get executed one at a time in the shell, which would make sense except some of the tasks I set ran in parallel just fine.

        Ah well, I'll give it a go! Though at this point I'm kind of tempted just to hack together a pipeline script in Python and ignore GATK-Queue entirely.

        Comment


        • #5
          I'm guessing that the tasks that did run in parallel were using walkers that support GATK's 'shared memory parallelism' - i.e., the same ones that support the '-nt' thread option on the command line. I don't think RealignerTargetCreator is compatible with this mode, which leaves you with a serial bottleneck. Broad's solution to this is scatter/gather parallelism, but GATK Queue requires a working job scheduler to make this happen.



          I have successfully run the DataProcessingPipeline GATK Queue script with a couple of exome bam files on a local multi-core box with SGE and DRMAA installed (but started with the GATK Queue for SGE 'Hello World' script first to debug the installation). I've run other stuff directly on SGE using qsub, so it's a useful thing to have even if you don't use GATK Queue. Someone has some tips here:

          I installed the torque queuing system on my local machine (quad core, so if I give it 3 slots I can still work normally with one core reserved for firefox, a media player, a text editor, a compiler…


          Once you have a working SGE queue (and DRMAA) set up, GATK Queue should handle communication with it automatically. You need to be careful to have SGE queue and host configuration limits that won't allow GridEngine to consume all your RAM and swap (my config still needs some tweaking to do real work without bringing the machine to a grinding halt!).

          Since you mention Python, this also might be of interest:

          Note: new posts have moved to http://bcb.io/ Please look there for the latest updates and comments


          (I haven't tried it, and it also seems to require SGE).

          Comment


          • #6
            It's the other way around, actually. RealignerTargetCreator accepts the -nt thread option (but shows a slowdown rather than a speedup for me when I actually try to use it), but IndelRealigner doesn't. I don't think GATK-Queue ever uses the shared memory system unless you explicitly specify it, it always seems to go for scatter/gather.

            Ah well, I'll just throw a job scheduler on there and see what happens.

            Comment


            • #7
              Oh well, there goes that theory. It would be interesting to hear how you get on with this - I don't think that many people outside Broad are using GATK-Queue.

              Comment


              • #8
                Yeah, I'm starting to get the feeling that it's a little bit overkill without a massive compute farm.

                Comment


                • #9
                  Installing SGE on Debian 6 is turning out to be an enormous amount of bizarrely font-related stress (it's demanding a lot of X11 fonts that aren't installed by default, and no amount of non-free font packages seems to appease it).

                  I'm trying scatter-gathering manually in Python instead at the moment, which is working fine up to the CountCovariates step of base quality score recalibration, because the way they calculate empirical quality in the output CSV file is kinda weird. I'll need to recalculate it when I'm combining multiple scattered outputs, and I want my output to be as similar as possible to the output I'd get from running GATK serially.

                  Trying -10log(mismatches/total observations) gives me the right answer about 75% of the time, but their algorithm behaves weirdly in a lot of cases for reasons beyond minor log algorithm/rounding differences, and I can't understand from their source why that's the case.

                  The worst bit is the performance gain from parallelizing this bit will only shave 20 minutes off the pipeline run. I should really just run it serially, but that would be quitting.

                  If you're curious, here's their source:
                  Code:
                      //---------------------------------------------------------------------------------------------------------------
                      //
                      // methods to derive empirical quality score
                      //
                      //---------------------------------------------------------------------------------------------------------------
                  
                      public final double empiricalQualDouble( final int smoothing, final double maxQual ) {
                          final double doubleMismatches = (double) ( numMismatches + smoothing );
                          final double doubleObservations = (double) ( numObservations + smoothing );
                          double empiricalQual = -10 * Math.log10(doubleMismatches / doubleObservations);
                          if (empiricalQual > maxQual) { empiricalQual = maxQual; }
                          return empiricalQual;
                      }
                      public final double empiricalQualDouble() { return empiricalQualDouble( 0, QualityUtils.MAX_REASONABLE_Q_SCORE ); } // 'default' behavior is to use smoothing value of zero
                  
                      public final byte empiricalQualByte( final int smoothing ) {
                          final double doubleMismatches = (double) ( numMismatches + smoothing );
                          final double doubleObservations = (double) ( numObservations + smoothing );
                          return QualityUtils.probToQual( 1.0 - doubleMismatches / doubleObservations ); // This is capped at Q40
                      }
                      public final byte empiricalQualByte() { return empiricalQualByte( 0 ); } // 'default' behavior is to use smoothing value of zero

                  Comment


                  • #10
                    Originally posted by Rocketknight View Post
                    Installing SGE on Debian 6 is turning out to be an enormous amount of bizarrely font-related stress (it's demanding a lot of X11 fonts that aren't installed by default, and no amount of non-free font packages seems to appease it).
                    Yes, what is it with SGE and fonts?! The witchcraft that eventually worked for me (with a recent Ubuntu) was installing the set of font packages listed in section 8.4.1 of this blog post (then rebooting):



                    Don't know if this is the same issue with Debian (or if the packages are equivalent).

                    Comment


                    • #11
                      I found that! I tracked down those packages in the end but it still wouldn't launch. I suppose I should keep tinkering with that rather than attempting to reinvent the wheel with my Python gather algorithms.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Current Approaches to Protein Sequencing
                        by seqadmin


                        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                        04-04-2024, 04:25 PM
                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 04-11-2024, 12:08 PM
                      0 responses
                      27 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 10:19 PM
                      0 responses
                      31 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 09:21 AM
                      0 responses
                      27 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-04-2024, 09:00 AM
                      0 responses
                      52 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X