Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to use Celera assembler with HGAP

    Dear all,

    I am trying to run HGAP 2 over my dataset using locally installed smrtanalysis package with following modules:

    P_Fetch
    P_Filter
    P_FilterReports
    P_PreAssemblerDagcon (HGAP.2)
    --------------------
    * P_CeleraAssembler
    * P_Mapping
    * P_AssemblyPolishing

    I cannot figure out how to run Celera Assembler. I am running computation on only one node with enough RAM and space.

    Using <module id="P_CeleraAssembler" label="CeleraAssembler v1" editableInJob="true">

    with corresponding parameters yields into:

    [ERROR] 2014-01-12 15:39:24,088 [pbpy.smrtpipe.SmrtPipeMain _loadModules 180] Could not determine a libraryName. No jobId supplied and no 'libraryName' setting found

    Could you please help me or point me to the complete example of HGAP pipeline with listed all the modules needed?

    Thanks a million.

  • #2
    How are you generating your xml? The easiest way to generate a valid xml is to use one from a job generated via the SMRT Portal interface.
    Example of a valid HGAP.2 xml settings file, SMRT Analysis 2.1:
    Code:
      <?xml version="1.0" encoding="UTF-8" standalone="yes" ?> 
      <smrtpipeSettings>
      <protocol version="2.1.0" id="RS_HGAP_Assembly.2" editable="false">
      <param name="name" label="Protocol Name">
      <value>RS_HGAP_Assembly</value> 
      <input type="text" /> 
      <rule required="true" /> 
      </param>
      <param name="description">
      <value>DAGCON-based HGAP (Hierarchical Genome Assembly Process) performs high quality de novo assembly using a single PacBio library prep. HGAP consists of pre-assembly, de novo assembly with Celera Assembler, and assembly polishing with Quiver. This workflow uses a much faster DAG-based consensus approach for pre-assembly.</value> 
      <textarea /> 
      </param>
      <param name="version" hidden="true">
      <value>2</value> 
      <input type="text" /> 
      <rule type="digits" required="true" min="1.0" /> 
      </param>
      <param name="state" hidden="true">
      <value>active</value> 
      <input value="active" type="radio" /> 
      <input value="inactive" type="radio" /> 
      </param>
      <param name="otfReference" hidden="true">
      <value>reference</value> 
      </param>
      <param name="deferRefCheck" hidden="true">
      <value>True</value> 
      </param>
      <param name="fetch" hidden="true">
      <value>common/protocols/preprocessing/Fetch.1.xml</value> 
      </param>
      <param name="filtering">
      <value>common/protocols/filtering/PreAssemblerSFilter.1.xml</value> 
      <select multiple="true">
      <import extension="xml" contentType="text/directory">common/protocols/filtering</import> 
      </select>
      </param>
      <param name="assembly">
      <value>common/protocols/assembly/PreAssemblerHGAP.2.xml</value> 
      <value>common/protocols/assembly/CeleraAssemblerHGAP.2.xml</value> 
      <select multiple="true">
      <import extension="xml" contentType="text/directory">common/protocols/assembly</import> 
      </select>
      </param>
      <param name="referenceUploader" hidden="true">
      <value>common/protocols/referenceuploader/ReferenceUploaderHGAP.1.xml</value> 
      </param>
      <param name="mapping">
      <value>common/protocols/mapping/BLASR.1.xml</value> 
      <select multiple="true">
      <import extension="xml" contentType="text/directory">common/protocols/mapping</import> 
      </select>
      </param>
      <param name="consensus">
      <value>common/protocols/consensus/AssemblyPolishing.1.xml</value> 
      <select multiple="true">
      <import extension="xml" contentType="text/directory">common/protocols/consensus</import> 
      </select>
      </param>
      </protocol>
      <moduleStage name="fetch" editable="true">
      <module label="Fetch v1" id="P_Fetch" editableInJob="true">
      <description>Sets up inputs</description> 
      </module>
      </moduleStage>
      <moduleStage name="filtering" editable="true">
      <module label="PreAssembler Filter v1" id="P_Filter" editableInJob="true">
      <description>Filter reads for use in the pre-assembly step of HGAP, the hierarchical genome assembly process.</description> 
      <param name="minSubReadLength" label="Minimum Subread Length">
      <title>The minimum subread length. Shorter subreads will be filtered and excluded from further analysis.</title> 
      <value>500</value> 
      <input type="text" size="3" /> 
      <rule type="number" min="0.0" message="Value must be a positive integer" /> 
      </param>
      <param name="readScore" label="Minimum Polymerase Read Quality">
      <title>The minimum polymerase read quality determines the quality cutoff. Polymerase reads with lower quality will be filtered and excluded from further analysis.</title> 
      <value>0.80</value> 
      <input type="text" size="3" /> 
      <rule type="number" min="0.0" message="Value must be between 0 and 1" max="1.0" /> 
      </param>
      <param name="minLength" label="Minimum Polymerase Read Length">
      <title>The minimum polymerase read length. Shorter polymerase reads will be excluded from further analysis.</title> 
      <value>100</value> 
      <input type="text" size="3" /> 
      <rule type="number" min="0.0" message="Value must be a positive integer" /> 
      </param>
      </module>
      <module label="PreAssemblerSFilter Reports v1" id="P_FilterReports" editableInJob="false" /> 
      </moduleStage>
      <moduleStage name="assembly" editable="true">
      <module label="PreAssembler v2" id="P_PreAssemblerDagcon" editableInJob="true">
      <title>Using DAG-based consensus algorithm, pre-assemble long reads as the first step of the Hierarchical Genome Assembly process (HGAP). Version 2 is a stepping stone for scaling to much larger genomes.</title> 
      <param name="computeLengthCutoff" label="Compute Minimum Seed Read Length" editable="true">
      <title>Compute "Minimum Seed Read Length"</title> 
      <value>True</value> 
      <input type="checkbox" /> 
      </param>
      <param name="minLongReadLength" label="Minimum Seed Read Length">
      <title>Minimum length of reads to use as seeds for pre-assembly</title> 
      <value>3000</value> 
      <input type="text" /> 
      <rule type="digits" min="1.0" message="Value must be an integer between 1 and 100000" max="100000.0" /> 
      </param>
      <param name="targetChunks" label="Split Target into Chunks">
      <value>1</value> 
      </param>
      <param name="splitBestn" label="Alignment Candidates Per Chunk">
      <value>24</value> 
      </param>
      <param name="totalBestn">
      <value>24</value> 
      </param>
      <param name="blasrOpts" label="BLASR Options (Advanced)">
      <title>The -bestn and -nCandidates options should be approximately equal to the expected seed read coverage</title> 
      <value>-noSplitSubreads -minReadLength 200 -maxScore -1000 -maxLCPLength 16</value> 
      <input type="text" /> 
      </param>
      </module>
      <module label="CeleraAssembler v1" id="P_CeleraAssembler" editableInJob="true">
      <description>This module wraps the Celera Assembler v7.0</description> 
      <param name="runCA" hidden="true">
      <value>False</value> 
      </param>
      <param name="asm2afg" hidden="true">
      <value>False</value> 
      </param>
      <param name="castats" hidden="true">
      <value>False</value> 
      </param>
      <param name="afg2bank" hidden="true">
      <value>False</value> 
      </param>
      <param name="runBank2CmpH5" hidden="true">
      <value>False</value> 
      </param>
      <param name="assemblyBnkReport" hidden="true">
      <value>False</value> 
      </param>
      <param name="sortCmpH5" hidden="true">
      <value>False</value> 
      </param>
      <param name="gzipGff" hidden="true">
      <value>False</value> 
      </param>
      <param name="genomeSize" label="Genome Size (bp)">
      <title>Approximate genome size in base pairs</title> 
      <value>6000</value> 
      <input type="text" /> 
      <rule type="digits" required="true" min="1.0" message="Must be a value between 1 and 150000000" max="1.5E8" /> 
      </param>
      <param name="libraryName" hidden="true">
      <value>pacbioReads</value> 
      </param>
      <param name="genFrgFile" hidden="true">
      <value>True</value> 
      </param>
      <param name="defaultFrgMinLen" hidden="true">
      <value>500</value> 
      <input type="text" /> 
      </param>
      <param name="asmWatchTime" hidden="true">
      <title>Seconds to wait for runCA outputs to be copied into job dir.</title> 
      <value>600</value> 
      </param>
      <param name="xCoverage" label="Target Coverage">
      <title>Fold coverage to target when picking frgMinLen for assembly. Typically 15 to 25.</title> 
      <value>15</value> 
      <input type="text" /> 
      <rule type="digits" min="10.0" message="Value must be an integer between 10 and 30, inclusive" max="30.0" /> 
      </param>
      <param name="ovlErrorRate" label="Overlapper error rate">
      <title>Overlapper error rate</title> 
      <value>0.06</value> 
      <input type="text" /> 
      <rule type="number" message="Value must be numeric" /> 
      </param>
      <param name="ovlMinLen" label="Overlapper min length">
      <title>Overlaps shorter than this length are not computed.</title> 
      <value>40</value> 
      <input type="text" /> 
      <rule type="digits" message="Value must be an integer" /> 
      </param>
      <param name="merSize" label="Overlapper k-mer">
      <title>Sets the length of the seeds used by the seed and extend algorithm.</title> 
      <value>14</value> 
      <input type="text" /> 
      <rule type="digits" message="Value must be an integer" /> 
      </param>
      <param name="specInRunCA" label="Pre-defined spec file">
      <title>Enter the server path to an existing spec file</title> 
      <input type="text" /> 
      <rule required="false" remote="api/protocols/resource-exists?paramName=specInRunCA" message="File does not exist" /> 
      </param>
      <param name="maxSlotPerc" hidden="true">
      <value>1</value> 
      </param>
      </module>
      </moduleStage>
      <moduleStage name="referenceUploader" editable="false">
      <module id="P_ReferenceUploader" editableInJob="false">
      <param name="runUploaderHgap">
      <value>True</value> 
      </param>
      <param name="runUploader">
      <value>False</value> 
      </param>
      <param name="name">
      <value>reference</value> 
      </param>
      <param name="organism" /> 
      <param name="ploidy" /> 
      <param name="user" /> 
      <param name="sawriter">
      <value>sawriter -blt 8 -welter</value> 
      </param>
      <param name="gatkDict">
      <value>createSequenceDictionary</value> 
      </param>
      <param name="samIdx">
      <value>samtools faidx</value> 
      </param>
      </module>
      </moduleStage>
      <moduleStage name="mapping" editable="true">
      <module label="BLASR v1" id="P_Mapping" editableInJob="true">
      <description>BLASR maps reads to genomes by finding the highest scoring local alignment or set of local alignments between the read and the genome. The first set of alignments is found by querying an index of the reference genome, and then refining until only high scoring alignments are retained. Additional pulse metrics are loaded into the resulting cmp.h5 file to enable downstream use of the Quiver algorithm.</description> 
      <param name="maxHits" label="Maximum number of hits per read" hidden="true">
      <title>The maximum number of matches of each read to the reference sequence that will be evaluated. maxHits should be greater than the expected number of repeats if you want to spread hits out on the genome.</title> 
      <value>10</value> 
      <input type="text" /> 
      <rule type="digits" message="Value must be an integer between 0 and 1000" /> 
      </param>
      <param name="maxDivergence" label="Maximum divergence (%)">
      <title>The maximum allowed divergence of a read from the reference sequence.</title> 
      <value>30</value> 
      <input type="text" /> 
      <rule type="digits" message="Value must be an integer between 0 and 100" /> 
      </param>
      <param name="minAnchorSize" label="Minimum anchor size">
      <title>The minimum anchor size defines the length of the read that must match against the reference sequence.</title> 
      <value>12</value> 
      <input type="text" /> 
      <rule type="digits" message="Value must be an integer between 8 and 30" /> 
      </param>
      <param name="samBam" label="Write output as a BAM file">
      <value>True</value> 
      <input type="checkbox" /> 
      </param>
      <param name="gff2Bed" label="Write BED coverage file">
      <value>True</value> 
      <input type="checkbox" /> 
      </param>
      <param name="placeRepeatsRandomly" label="Place repeats randomly">
      <value>True</value> 
      <input type="checkbox" /> 
      </param>
      <param name="align_opts" hidden="true">
      <value>--seed=1 --minAccuracy=0.75 --minLength=50 --useQuality</value> 
      </param>
      <param name="pulseMetrics" hidden="true">
      <value>DeletionQV,IPD,InsertionQV,PulseWidth,QualityValue,MergeQV,SubstitutionQV,DeletionTag</value> 
      </param>
      <param name="loadPulsesOpts" hidden="true">
      <title>The default option of loadPulses is 'byread'. Option 'bymetric' is desined to sacrifice memory for increased speed, especially for jobs of which the number of reference contigs is large.</title> 
      <value>bymetric</value> 
      </param>
      </module>
      <module label="BLASR Reports v1" id="P_MappingReports" editableInJob="false" /> 
      </moduleStage>
      <moduleStage name="consensus" editable="true">
      <module label="AssemblyPolishing v1 (Quiver)" id="P_AssemblyPolishing" editableInJob="true">
      <description>Polish a pure-PacBio assembly for maximum accuracy using the Quiver algorithm.</description> 
      <param name="enableMapQVFilter" label="Use only unambiguously mapped reads">
      <title>Filter out reads with Map QV less than 10. Coverage in repeat regions shorter than the read length will be reduced.</title> 
      <value>True</value> 
      <input type="checkbox" /> 
      </param>
      </module>
      </moduleStage>
      <fileName>RS_HGAP_Assembly.2.xml</fileName> 
      </smrtpipeSettings>

    Comment


    • #3
      settings.xml when not using portal

      Thank you very much! I generated my settings.xml for HGAP by hand following these steps:

      * P_PreAssemblerDagcon (HGAP.2)
      * P_CeleraAssembler
      * P_Mapping
      * P_AssemblyPolishing
      based on this manual: https://github.com/PacificBioscience...nce-Guide-v2.1

      I don't have portal installed, just command-line tools. If this is the case, can I use settings.xml generated by portal? I used the one you posted but it did not work

      [CRITICAL] 2014-01-14 11:30:54,067 hard exiting smrtpipe v1.79.127046 with returncode -1 in 0.01 sec (0.00 min).

      log says:
      [ERROR] 2014-01-14 11:10:25,117 [pbpy.smrtpipe.SmrtPipeSettings load 107] Error while processing --params file: settings.xml
      [ERROR] 2014-01-14 11:10:25,117 [pbpy.smrtpipe.SmrtPipeMain run 581] XML or text declaration not at start of entity: line 1, column 2

      What's the proper way of composing settings.xml when not using portal?

      Also, I would like to run HGAP with demo data and demo settings.xml so that I could be sure that my installation is working and possible errors are because of characters of my input data and/or wrong user settings. Is there something like this? I have seen some e.coli dataset released but not corresponding settings file.

      Thank you very much for the answers.

      Comment


      • #4
        Sorry, that was a problem with my pasting of the xml, not sure where the ' ' at the beginning of every line came from, the correct xml:
        Code:
        <?xml version="1.0" encoding="UTF-8" standalone="yes" ?> 
        <smrtpipeSettings>
        <protocol version="2.1.0" id="RS_HGAP_Assembly.2" editable="false">
        <param name="name" label="Protocol Name">
        <value>RS_HGAP_Assembly</value> 
        <input type="text" /> 
        <rule required="true" /> 
        </param>
        <param name="description">
        <value>DAGCON-based HGAP (Hierarchical Genome Assembly Process) performs high quality de novo assembly using a single PacBio library prep. HGAP consists of pre-assembly, de novo assembly with Celera Assembler, and assembly polishing with Quiver. This workflow uses a much faster DAG-based consensus approach for pre-assembly.</value> 
        <textarea /> 
        </param>
        <param name="version" hidden="true">
        <value>2</value> 
        <input type="text" /> 
        <rule type="digits" required="true" min="1.0" /> 
        </param>
        <param name="state" hidden="true">
        <value>active</value> 
        <input value="active" type="radio" /> 
        <input value="inactive" type="radio" /> 
        </param>
        <param name="otfReference" hidden="true">
        <value>reference</value> 
        </param>
        <param name="deferRefCheck" hidden="true">
        <value>True</value> 
        </param>
        <param name="fetch" hidden="true">
        <value>common/protocols/preprocessing/Fetch.1.xml</value> 
        </param>
        <param name="filtering">
        <value>common/protocols/filtering/PreAssemblerSFilter.1.xml</value> 
        <select multiple="true">
        <import extension="xml" contentType="text/directory">common/protocols/filtering</import> 
        </select>
        </param>
        <param name="assembly">
        <value>common/protocols/assembly/PreAssemblerHGAP.2.xml</value> 
        <value>common/protocols/assembly/CeleraAssemblerHGAP.2.xml</value> 
        <select multiple="true">
        <import extension="xml" contentType="text/directory">common/protocols/assembly</import> 
        </select>
        </param>
        <param name="referenceUploader" hidden="true">
        <value>common/protocols/referenceuploader/ReferenceUploaderHGAP.1.xml</value> 
        </param>
        <param name="mapping">
        <value>common/protocols/mapping/BLASR.1.xml</value> 
        <select multiple="true">
        <import extension="xml" contentType="text/directory">common/protocols/mapping</import> 
        </select>
        </param>
        <param name="consensus">
        <value>common/protocols/consensus/AssemblyPolishing.1.xml</value> 
        <select multiple="true">
        <import extension="xml" contentType="text/directory">common/protocols/consensus</import> 
        </select>
        </param>
        </protocol>
        <moduleStage name="fetch" editable="true">
        <module label="Fetch v1" id="P_Fetch" editableInJob="true">
        <description>Sets up inputs</description> 
        </module>
        </moduleStage>
        <moduleStage name="filtering" editable="true">
        <module label="PreAssembler Filter v1" id="P_Filter" editableInJob="true">
        <description>Filter reads for use in the pre-assembly step of HGAP, the hierarchical genome assembly process.</description> 
        <param name="minSubReadLength" label="Minimum Subread Length">
        <title>The minimum subread length. Shorter subreads will be filtered and excluded from further analysis.</title> 
        <value>500</value> 
        <input type="text" size="3" /> 
        <rule type="number" min="0.0" message="Value must be a positive integer" /> 
        </param>
        <param name="readScore" label="Minimum Polymerase Read Quality">
        <title>The minimum polymerase read quality determines the quality cutoff. Polymerase reads with lower quality will be filtered and excluded from further analysis.</title> 
        <value>0.80</value> 
        <input type="text" size="3" /> 
        <rule type="number" min="0.0" message="Value must be between 0 and 1" max="1.0" /> 
        </param>
        <param name="minLength" label="Minimum Polymerase Read Length">
        <title>The minimum polymerase read length. Shorter polymerase reads will be excluded from further analysis.</title> 
        <value>100</value> 
        <input type="text" size="3" /> 
        <rule type="number" min="0.0" message="Value must be a positive integer" /> 
        </param>
        </module>
        <module label="PreAssemblerSFilter Reports v1" id="P_FilterReports" editableInJob="false" /> 
        </moduleStage>
        <moduleStage name="assembly" editable="true">
        <module label="PreAssembler v2" id="P_PreAssemblerDagcon" editableInJob="true">
        <title>Using DAG-based consensus algorithm, pre-assemble long reads as the first step of the Hierarchical Genome Assembly process (HGAP). Version 2 is a stepping stone for scaling to much larger genomes.</title> 
        <param name="computeLengthCutoff" label="Compute Minimum Seed Read Length" editable="true">
        <title>Compute "Minimum Seed Read Length"</title> 
        <value>True</value> 
        <input type="checkbox" /> 
        </param>
        <param name="minLongReadLength" label="Minimum Seed Read Length">
        <title>Minimum length of reads to use as seeds for pre-assembly</title> 
        <value>3000</value> 
        <input type="text" /> 
        <rule type="digits" min="1.0" message="Value must be an integer between 1 and 100000" max="100000.0" /> 
        </param>
        <param name="targetChunks" label="Split Target into Chunks">
        <value>1</value> 
        </param>
        <param name="splitBestn" label="Alignment Candidates Per Chunk">
        <value>24</value> 
        </param>
        <param name="totalBestn">
        <value>24</value> 
        </param>
        <param name="blasrOpts" label="BLASR Options (Advanced)">
        <title>The -bestn and -nCandidates options should be approximately equal to the expected seed read coverage</title> 
        <value>-noSplitSubreads -minReadLength 200 -maxScore -1000 -maxLCPLength 16</value> 
        <input type="text" /> 
        </param>
        </module>
        <module label="CeleraAssembler v1" id="P_CeleraAssembler" editableInJob="true">
        <description>This module wraps the Celera Assembler v7.0</description> 
        <param name="runCA" hidden="true">
        <value>False</value> 
        </param>
        <param name="asm2afg" hidden="true">
        <value>False</value> 
        </param>
        <param name="castats" hidden="true">
        <value>False</value> 
        </param>
        <param name="afg2bank" hidden="true">
        <value>False</value> 
        </param>
        <param name="runBank2CmpH5" hidden="true">
        <value>False</value> 
        </param>
        <param name="assemblyBnkReport" hidden="true">
        <value>False</value> 
        </param>
        <param name="sortCmpH5" hidden="true">
        <value>False</value> 
        </param>
        <param name="gzipGff" hidden="true">
        <value>False</value> 
        </param>
        <param name="genomeSize" label="Genome Size (bp)">
        <title>Approximate genome size in base pairs</title> 
        <value>4600000</value> 
        <input type="text" /> 
        <rule type="digits" required="true" min="1.0" message="Must be a value between 1 and 150000000" max="1.5E8" /> 
        </param>
        <param name="libraryName" hidden="true">
        <value>pacbioReads</value> 
        </param>
        <param name="genFrgFile" hidden="true">
        <value>True</value> 
        </param>
        <param name="defaultFrgMinLen" hidden="true">
        <value>500</value> 
        <input type="text" /> 
        </param>
        <param name="asmWatchTime" hidden="true">
        <title>Seconds to wait for runCA outputs to be copied into job dir.</title> 
        <value>600</value> 
        </param>
        <param name="xCoverage" label="Target Coverage">
        <title>Fold coverage to target when picking frgMinLen for assembly. Typically 15 to 25.</title> 
        <value>15</value> 
        <input type="text" /> 
        <rule type="digits" min="10.0" message="Value must be an integer between 10 and 30, inclusive" max="30.0" /> 
        </param>
        <param name="ovlErrorRate" label="Overlapper error rate">
        <title>Overlapper error rate</title> 
        <value>0.06</value> 
        <input type="text" /> 
        <rule type="number" message="Value must be numeric" /> 
        </param>
        <param name="ovlMinLen" label="Overlapper min length">
        <title>Overlaps shorter than this length are not computed.</title> 
        <value>40</value> 
        <input type="text" /> 
        <rule type="digits" message="Value must be an integer" /> 
        </param>
        <param name="merSize" label="Overlapper k-mer">
        <title>Sets the length of the seeds used by the seed and extend algorithm.</title> 
        <value>14</value> 
        <input type="text" /> 
        <rule type="digits" message="Value must be an integer" /> 
        </param>
        <param name="specInRunCA" label="Pre-defined spec file">
        <title>Enter the server path to an existing spec file</title> 
        <input type="text" /> 
        <rule required="false" remote="api/protocols/resource-exists?paramName=specInRunCA" message="File does not exist" /> 
        </param>
        <param name="maxSlotPerc" hidden="true">
        <value>1</value> 
        </param>
        </module>
        </moduleStage>
        <moduleStage name="referenceUploader" editable="false">
        <module id="P_ReferenceUploader" editableInJob="false">
        <param name="runUploaderHgap">
        <value>True</value> 
        </param>
        <param name="runUploader">
        <value>False</value> 
        </param>
        <param name="name">
        <value>reference</value> 
        </param>
        <param name="organism" /> 
        <param name="ploidy" /> 
        <param name="user" /> 
        <param name="sawriter">
        <value>sawriter -blt 8 -welter</value> 
        </param>
        <param name="gatkDict">
        <value>createSequenceDictionary</value> 
        </param>
        <param name="samIdx">
        <value>samtools faidx</value> 
        </param>
        </module>
        </moduleStage>
        <moduleStage name="mapping" editable="true">
        <module label="BLASR v1" id="P_Mapping" editableInJob="true">
        <description>BLASR maps reads to genomes by finding the highest scoring local alignment or set of local alignments between the read and the genome. The first set of alignments is found by querying an index of the reference genome, and then refining until only high scoring alignments are retained. Additional pulse metrics are loaded into the resulting cmp.h5 file to enable downstream use of the Quiver algorithm.</description> 
        <param name="maxHits" label="Maximum number of hits per read" hidden="true">
        <title>The maximum number of matches of each read to the reference sequence that will be evaluated. maxHits should be greater than the expected number of repeats if you want to spread hits out on the genome.</title> 
        <value>10</value> 
        <input type="text" /> 
        <rule type="digits" message="Value must be an integer between 0 and 1000" /> 
        </param>
        <param name="maxDivergence" label="Maximum divergence (%)">
        <title>The maximum allowed divergence of a read from the reference sequence.</title> 
        <value>30</value> 
        <input type="text" /> 
        <rule type="digits" message="Value must be an integer between 0 and 100" /> 
        </param>
        <param name="minAnchorSize" label="Minimum anchor size">
        <title>The minimum anchor size defines the length of the read that must match against the reference sequence.</title> 
        <value>12</value> 
        <input type="text" /> 
        <rule type="digits" message="Value must be an integer between 8 and 30" /> 
        </param>
        <param name="samBam" label="Write output as a BAM file">
        <value>True</value> 
        <input type="checkbox" /> 
        </param>
        <param name="gff2Bed" label="Write BED coverage file">
        <value>True</value> 
        <input type="checkbox" /> 
        </param>
        <param name="placeRepeatsRandomly" label="Place repeats randomly">
        <value>True</value> 
        <input type="checkbox" /> 
        </param>
        <param name="align_opts" hidden="true">
        <value>--seed=1 --minAccuracy=0.75 --minLength=50 --useQuality</value> 
        </param>
        <param name="pulseMetrics" hidden="true">
        <value>DeletionQV,IPD,InsertionQV,PulseWidth,QualityValue,MergeQV,SubstitutionQV,DeletionTag</value> 
        </param>
        <param name="loadPulsesOpts" hidden="true">
        <title>The default option of loadPulses is 'byread'. Option 'bymetric' is desined to sacrifice memory for increased speed, especially for jobs of which the number of reference contigs is large.</title> 
        <value>bymetric</value> 
        </param>
        </module>
        <module label="BLASR Reports v1" id="P_MappingReports" editableInJob="false" /> 
        </moduleStage>
        <moduleStage name="consensus" editable="true">
        <module label="AssemblyPolishing v1 (Quiver)" id="P_AssemblyPolishing" editableInJob="true">
        <description>Polish a pure-PacBio assembly for maximum accuracy using the Quiver algorithm.</description> 
        <param name="enableMapQVFilter" label="Use only unambiguously mapped reads">
        <title>Filter out reads with Map QV less than 10. Coverage in repeat regions shorter than the read length will be reduced.</title> 
        <value>True</value> 
        <input type="checkbox" /> 
        </param>
        </module>
        </moduleStage>
        <fileName>RS_HGAP_Assembly.2.xml</fileName> 
        </smrtpipeSettings>
        this xml should be sufficient for processing the test ecoli dataset. Unfortunately it is not easy to generate settings files outside of SMRT Portal, while it is possible via the documentation that you reference, as you found out it isn't always easy. Once you have a working settings file it should be easy to change parameters and alter the workflow. When you get a job to run, take a look at <job folder>/workflow/Workflow.summary.html it shows a nice graph of the executed tasks, each of which is a separate shell command that can be found in <job folder>/workflow/<module>/<task>.sh all of which have corresponding log output in <job folder>/log/<module>/<task>.log.

        Comment


        • #5
          Thanks a million, your code worked!! I did get a nice summary about e.coli dataset.

          Comment


          • #6
            Nice work, rhall.

            But I'd make this observation: Monika and every other user still have the problem that the protocol files supplied with smrtanalysis can't be passed directly to smrtpipe: They need to be expanded by SMRTportal. That makes it difficult to run secondary analysis jobs from the command line or from scripts.

            It would be useful if the XML-expanding logic could be run as a standalone step.

            --TS

            Comment


            • #7
              Originally posted by SillyPoint View Post
              Nice work, rhall.

              But I'd make this observation: Monika and every other user still have the problem that the protocol files supplied with smrtanalysis can't be passed directly to smrtpipe: They need to be expanded by SMRTportal. That makes it difficult to run secondary analysis jobs from the command line or from scripts.

              It would be useful if the XML-expanding logic could be run as a standalone step.

              --TS
              Exactly. Now when 2.2.0 is out, I was wondering where to grab the settings.xml code for the newest HGAP.

              Comment


              • #8
                No problem,
                Code:
                <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
                <smrtpipeSettings>
                    <protocol version="2.2.0" id="RS_HGAP_Assembly_3.3" editable="false">
                        <param name="name" label="Protocol Name">
                            <value>RS_HGAP_Assembly_3</value>
                            <input type="text"/>
                            <rule required="true"/>
                        </param>
                        <param name="description">
                            <value>(BETA) HGAP version 3. PacBio de novo assembler optimized for speed.</value>
                            <textarea></textarea>
                        </param>
                        <param name="version" hidden="true">
                            <value>3</value>
                            <input type="text"/>
                            <rule type="digits" required="true" min="1.0"/>
                        </param>
                        <param name="state" hidden="true">
                            <value>active</value>
                            <input value="active" type="radio"/>
                            <input value="inactive" type="radio"/>
                        </param>
                        <param name="otfReference" hidden="true">
                            <value>reference</value>
                        </param>
                        <param name="deferRefCheck" hidden="true">
                            <value>True</value>
                        </param>
                        <param name="control" hidden="true">
                            <value></value>
                        </param>
                        <param name="fetch" hidden="true">
                            <value>common/protocols/preprocessing/Fetch.1.xml</value>
                        </param>
                        <param name="filtering">
                            <value>common/protocols/filtering/PreAssemblerSFilter.1.xml</value>
                            <select multiple="true">
                                <import extension="xml" contentType="text/directory">common/protocols/filtering</import>
                            </select>
                        </param>
                        <param name="Control Filtering" editableInJob="true">
                            <value>common/protocols/control/KeepControlReads.1.xml</value>
                            <select multiple="false">
                                <import extension="xml" contentType="text/directory">common/protocols/control</import>
                            </select>
                        </param>
                        <param name="assembly">
                            <value>common/protocols/assembly/PreAssemblerHGAP.3.xml</value>
                            <value>common/protocols/assembly/AssembleUnitig.1.xml</value>
                            <select multiple="true">
                                <import extension="xml" contentType="text/directory">common/protocols/assembly</import>
                            </select>
                        </param>
                        <param name="referenceUploader" hidden="true">
                            <value>common/protocols/referenceuploader/ReferenceUploaderUnitig.1.xml</value>
                        </param>
                        <param name="mapping">
                            <value>common/protocols/mapping/BLASR.1.xml</value>
                            <select multiple="true">
                                <import extension="xml" contentType="text/directory">common/protocols/mapping</import>
                            </select>
                        </param>
                        <param name="consensus">
                            <value>common/protocols/consensus/AssemblyPolishing.1.xml</value>
                            <select multiple="true">
                                <import extension="xml" contentType="text/directory">common/protocols/consensus</import>
                            </select>
                        </param>
                    </protocol>
                    <moduleStage name="fetch" editable="true">
                        <module label="Fetch v1" id="P_Fetch" editableInJob="true">
                            <description>Sets up inputs</description>
                        </module>
                    </moduleStage>
                    <moduleStage name="filtering" editable="true">
                        <module label="PreAssembler Filter v1" id="P_Filter" editableInJob="true">
                            <description>Filter reads for use in the pre-assembly step of HGAP, the hierarchical genome assembly process.</description>
                            <param name="minSubReadLength" label="Minimum Subread Length">
                                <value>500</value>
                                <title>Subreads shorter than this value (in base pairs) are filtered out and excluded from analysis.</title>
                                <input type="text" size="3"/>
                                <rule type="number" min="0.0" message="Value must be a positive integer"/>
                            </param>
                            <param name="readScore" label="Minimum Polymerase Read Quality">
                                <value>0.80</value>
                                <title>Polymerase reads with lower quality than this value are filtered out and excluded from analysis.</title>
                                <input type="text" size="3"/>
                                <rule type="number" min="0.0" message="Value must be between 0 and 1" max="1.0"/>
                            </param>
                            <param name="minLength" label="Minimum Polymerase Read Length">
                                <value>100</value>
                                <title>Polymerase reads shorter than this value (in base pairs) are filtered out and excluded from analysis.</title>
                                <input type="text" size="3"/>
                                <rule type="number" min="0.0" message="Value must be a positive integer"/>
                            </param>
                        </module>
                        <module label="PreAssemblerSFilter Reports v1" id="P_FilterReports" editableInJob="false"/>
                    </moduleStage>
                    <moduleStage name="Control Filtering" editable="true"/>
                    <moduleStage name="assembly" editable="true">
                        <module label="PreAssembler v2" id="P_PreAssemblerDagcon" editableInJob="true">
                            <title>Using DAG-based consensus algorithm, pre-assemble long reads as the first step of the Hierarchical Genome Assembly process (HGAP). Version 2 is a stepping stone for scaling to much larger genomes.</title>
                            <param name="computeLengthCutoff" label="Compute Minimum Seed Read Length" editable="true">
                                <value>True</value>
                                <title>Specify whether or not to compute the minimum seed read length that results in at least 30X target genome coverage, by the longest subreads. This is based on the genome size you specified.</title>
                                <input type="checkbox"/>
                            </param>
                            <param name="minLongReadLength" label="Minimum Seed Read Length">
                                <value>6000</value>
                                <title>The minimum length of reads (in base pairs) to use as seeds for pre-assembly.</title>
                                <input type="text"/>
                                <rule type="digits" required="true" min="1.0" message="Value must be an integer between 1 and 100000" max="100000.0"/>
                            </param>
                            <param name="targetChunks" label="Number of Seed Read Chunks">
                                <value>6</value>
                                <title>The number of pieces to split the data files into while running PreAssembler.</title>
                            </param>
                            <param name="splitBestn" label="Alignment Candidates Per Chunk">
                                <value>10</value>
                                <title>The number of alignments to consider for each read for a particular chunk.</title>
                            </param>
                            <param name="totalBestn" label="Total Alignment Candidates">
                                <value>24</value>
                                <title>The number of potential alignments BLASR should consider across all chunks for a particular read.</title>
                            </param>
                            <param name="minCorCov" label="Minimum Coverage for Correction">
                                <value>6</value>
                                <title>The minimum coverage to maintain correction for a read.  If the coverage falls below that threshold, the read will be broken at that juntion.</title>
                            </param>
                            <param name="blasrOpts" label="BLASR Options (Advanced)">
                                <value> -noSplitSubreads -minReadLength 200 -maxScore -1000 -maxLCPLength 16 </value>
                                <title>The -bestn and -nCandidates options should be approximately equal to the expected seed read coverage</title>
                                <input type="text"/>
                            </param>
                        </module>
                        <module label="AssembleUnitig v1" id="P_AssembleUnitig" editableInJob="true">
                            <description>This module runs Celera Assembler v8.1 to the unitig step, then finishes with our custom unitig consensus caller</description>
                            <param name="genomeSize" label="Genome Size (bp)">
                                <value>25000000</value>
                                <title>The approximate genome size, in base pairs.</title>
                                <input type="text"/>
                                <rule type="digits" required="true" min="1.0" message="Must be a value between 1 and 150000000" max="1.5E8"/>
                            </param>
                            <param name="libraryName" hidden="true">
                                <value>pacbioReads</value>
                            </param>
                            <param name="defaultFrgMinLen" hidden="true">
                                <value>500</value>
                                <input type="text"/>
                            </param>
                            <param name="xCoverage" label="Target Coverage">
                                <value>25</value>
                                <title>Fold coverage to target for when picking the minimum fragment length for assembly; typically 15 to 25.
                			</title>
                                <input type="text"/>
                                <rule type="digits" min="10.0" message="Value must be an integer between 10 and 30, inclusive" max="30.0"/>
                            </param>
                            <param name="ovlErrorRate" label="Overlapper error rate">
                                <value>0.06</value>
                                <title>Trimming and assembly overlaps above this error limit won't be detected.</title>
                                <input type="text"/>
                                <rule type="number" message="Value must be numeric"/>
                            </param>
                            <param name="ovlMinLen" label="Overlapper min length">
                                <value>40</value>
                                <title>Overlaps shorter than this length (in base pairs) are not computed.</title>
                                <input type="text"/>
                                <rule type="digits" message="Value must be an integer"/>
                            </param>
                            <param name="merSize" label="Overlapper k-mer">
                                <value>14</value>
                                <title>The length of the seeds (in base pairs) used by the seed-and-extend algorithm.</title>
                                <input type="text"/>
                                <rule type="digits" message="Value must be an integer"/>
                            </param>
                            <param name="specInRunCA" label="Pre-defined spec file">
                                <title>The path to an existing specification file used to run the assembly program.</title>
                                <input type="text"/>
                                <rule required="false" remote="api/protocols/resource-exists?paramName=specInRunCA" message="File does not exist"/>
                            </param>
                            <param name="maxSlotPerc" hidden="true">
                                <value>1</value>
                            </param>
                            <param name="specTmpl" hidden="true">
                                <value>analysis/etc/celeraAssembler/unitig.spec</value>
                            </param>
                        </module>
                    </moduleStage>
                    <moduleStage name="referenceUploader" editable="false">
                        <module id="P_ReferenceUploader" editableInJob="false">
                            <param name="runUploaderHgap">
                                <value>False</value>
                            </param>
                            <param name="runUploader">
                                <value>False</value>
                            </param>
                            <param name="runUploaderUnitig">
                                <value>True</value>
                            </param>
                            <param name="name">
                                <value>reference</value>
                            </param>
                            <param name="organism"/>
                            <param name="ploidy"/>
                            <param name="user"/>
                            <param name="sawriter">
                                <value>sawriter -blt 8 -welter</value>
                            </param>
                            <param name="samIdx">
                                <value>samtools faidx</value>
                            </param>
                        </module>
                    </moduleStage>
                    <moduleStage name="mapping" editable="true">
                        <module label="BLASR v1" id="P_Mapping" editableInJob="true">
                            <description>
                BLASR maps reads to genomes by finding the highest scoring local alignment or set of local alignments between the read and the genome. The first set of alignments is found by querying an index of the reference genome, and then refining until only high scoring alignments are retained.  Additional pulse metrics are loaded into the resulting cmp.h5 file to enable downstream use of the Quiver algorithm.
                    </description>
                            <param name="maxHits" label="Maximum number of hits per read" hidden="true">
                                <value>10</value>
                                <title>
                        The maximum number of matches of each read to the reference
                        sequence that will be evaluated. maxHits should be greater
                        than the expected number of repeats if you want to spread hits
                        out on the genome.
                      </title>
                                <input type="text"/>
                                <rule type="digits" message="Value must be an integer between 0 and 1000"/>
                            </param>
                            <param name="maxDivergence" label="Maximum divergence (%)">
                                <value>30</value>
                                <title>The maximum allowed divergence (in %) of a read from the reference sequence.</title>
                                <input type="text"/>
                                <rule type="digits" message="Value must be an integer between 0 and 100"/>
                            </param>
                            <param name="minAnchorSize" label="Minimum anchor size">
                                <value>12</value>
                                <title>The minimum size of the read (in base pairs) that must match against the reference.</title>
                                <input type="text"/>
                                <rule type="digits" message="Value must be an integer between 8 and 30"/>
                            </param>
                            <param name="samBam" label="Write output as a BAM file">
                                <value>True</value>
                                <title>Specify whether or not to output a BAM representation of the cmp.h5 file.</title>
                                <input type="checkbox"/>
                            </param>
                            <param name="gff2Bed" label="Write BED coverage file">
                                <value>True</value>
                                <title>Specify whether or not to output a BED representation of the depth of coverage summary.</title>
                                <input type="checkbox"/>
                            </param>
                            <param name="placeRepeatsRandomly" label="Place repeats randomly">
                                <value>True</value>
                                <title>Specify that if BLASR maps a read to more than one location with equal probability, then it randomly selects which location it chooses as the best location. If not set, defaults to the first on the list of matches.</title>
                                <input type="checkbox"/>
                            </param>
                            <param name="pbalign_opts" hidden="true">
                                <value>--seed=1 --minAccuracy=0.75 --minLength=50 --algorithmOptions="-useQuality"</value>
                            </param>
                            <param name="pulseMetrics" hidden="true">
                                <value>DeletionQV,IPD,InsertionQV,PulseWidth,QualityValue,MergeQV,SubstitutionQV,DeletionTag</value>
                            </param>
                            <param name="loadPulsesOpts" hidden="true">
                                <value>bymetric</value>
                                <title>The default option of loadPulses is 'byread'. Option 'bymetric' 
                               is desined to sacrifice memory for increased speed, especially 
                               for jobs of which the number of reference contigs is large. </title>
                            </param>
                        </module>
                        <module label="BLASR Reports v1" id="P_MappingReports" editableInJob="false"/>
                    </moduleStage>
                    <moduleStage name="consensus" editable="true">
                        <module label="AssemblyPolishing v1 (Quiver)" id="P_AssemblyPolishing" editableInJob="true">
                            <description>Polish a pure-PacBio assembly for maximum accuracy using the Quiver algorithm.</description>
                            <param name="enableMapQVFilter" label="Use only unambiguously mapped reads">
                                <value>True</value>
                                <title>Specify whether or not to filter out reads where Map QV is less than 10. Reduces coverage in repeat regions that are shorter than the read length.</title>
                                <input type="checkbox"/>
                            </param>
                        </module>
                    </moduleStage>
                    <fileName>RS_HGAP_Assembly.3.xml</fileName>
                </smrtpipeSettings>
                Hopefully for the 2.3 release this problem will be fixed, at the very least SMRT Analysis will come with a set of example xml protocol files.

                Comment


                • #9
                  Originally posted by rhall View Post
                  Hopefully for the 2.3 release this problem will be fixed, at the very least SMRT Analysis will come with a set of example xml protocol files.
                  Can we hold you to that promise

                  Comment


                  • #10
                    I'm not a developer, but I have a bug / feature request number and it has been assigned to 2.3, I'm crossing my fingers

                    Comment


                    • #11
                      It has been pointed out to me that this has been partially implemented already:
                      Code:
                      smrtpipe.py --examples

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Essential Discoveries and Tools in Epitranscriptomics
                        by seqadmin


                        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                        Yesterday, 07:01 AM
                      • seqadmin
                        Current Approaches to Protein Sequencing
                        by seqadmin


                        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                        04-04-2024, 04:25 PM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 04-11-2024, 12:08 PM
                      0 responses
                      39 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 10:19 PM
                      0 responses
                      41 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 09:21 AM
                      0 responses
                      35 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-04-2024, 09:00 AM
                      0 responses
                      55 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X