RS_HGAP_Assembly HGAP version 3. PacBio de novo assembler optimized for speed. 3 active reference True /PHShome/ry077/bin/smrtanalysis/userdata/references/2kb_control common/protocols/preprocessing/Fetch.1.xml common/protocols/filtering/PreAssemblerSFilter.1.xml common/protocols/control/KeepControlReads.1.xml common/protocols/assembly/PreAssemblerHGAP.3.xml common/protocols/assembly/AssembleUnitig.1.xml common/protocols/referenceuploader/ReferenceUploaderUnitig.1.xml common/protocols/mapping/BLASR_Resequencing.1.xml common/protocols/consensus/AssemblyPolishing.1.xml Sets up inputs Filter reads for use in the pre-assembly step of HGAP, the hierarchical genome assembly process. 3000 Subreads shorter than this value (in base pairs) are filtered out and excluded from analysis. 0.80 Polymerase reads with lower quality than this value are filtered out and excluded from analysis. 100 Polymerase reads shorter than this value (in base pairs) are filtered out and excluded from analysis. /data/talkowski/Samples/XDP/PacBioBAC/BACassembly/blasr/whitelist.txt Using DAG-based consensus algorithm, pre-assemble long reads as the first step of the Hierarchical Genome Assembly process (HGAP). Version 2 is a stepping stone for scaling to much larger genomes. False Specify whether or not to compute the minimum seed read length that results in at least 30X target genome coverage, by the longest subreads. This is based on the genome size you specified. 2000 The minimum length of reads (in base pairs) to use as seeds for pre-assembly. 6 The number of pieces to split the data files into while running PreAssembler. 10 The number of alignments to consider for each read for a particular chunk. 24 The number of potential alignments BLASR should consider across all chunks for a particular read. 6 The minimum coverage to maintain correction for a read. If the coverage falls below that threshold, the read will be broken at that juntion. -noSplitSubreads -minReadLength 200 -maxScore -1000 -maxLCPLength 16 The -bestn and -nCandidates options should be approximately equal to the expected seed read coverage This module runs Celera Assembler v8.1 to the unitig step, then finishes with our custom unitig consensus caller 124000 The approximate genome size, in base pairs. pacbioReads 500 25 Fold coverage to target for when picking the minimum fragment length for assembly; typically 15 to 25. 0.06 Trimming and assembly overlaps above this error limit won't be detected. 40 Overlaps shorter than this length (in base pairs) are not computed. 14 The length of the seeds (in base pairs) used by the seed-and-extend algorithm. The path to an existing specification file used to run the assembly program. 1 analysis/etc/celeraAssembler/unitig.spec False False True reference sawriter -blt 8 -welter samtools faidx BLASR maps reads to genomes by finding the highest scoring local alignment or set of local alignments between the read and the genome. The first set of alignments is found by querying an index of the reference genome, and then refining until only high scoring alignments are retained. Additional pulse metrics are loaded into the resulting cmp.h5 file to enable downstream use of the Quiver algorithm. 10 The maximum number of matches of each read to the reference sequence that will be evaluated. maxHits should be greater than the expected number of repeats if you want to spread hits out on the genome. 30 The maximum allowed divergence (in %) of a read from the reference sequence. 12 The minimum size of the read (in base pairs) that must match against the reference. True Specify whether or not to output a BAM representation of the cmp.h5 file. True Specify whether or not to output a BED representation of the depth of coverage summary. True Specify that if BLASR maps a read to more than one location with equal probability, then it randomly selects which location it chooses as the best location. If not set, defaults to the first on the list of matches. --seed=1 --minAccuracy=0.75 --minLength=50 --concordant --algorithmOptions="-useQuality" Specify additional Pbalign options. For advanced users only. DeletionQV,IPD,InsertionQV,PulseWidth,QualityValue,MergeQV,SubstitutionQV,DeletionTag bymetric The default option of loadPulses is 'byread'. Option 'bymetric' is desined to sacrifice memory for increased speed, especially for jobs of which the number of reference contigs is large. Polish a pure-PacBio assembly for maximum accuracy using the Quiver algorithm. True Specify whether or not to filter out reads where Map QV is less than 10. Reduces coverage in repeat regions that are shorter than the read length. settings.xml