HGAP_Assembly_Advanced HGAP (Hierarchical Genome Assembly Process) performs high quality de novo assembly using a single PacBio library prep. HGAP consists of pre-assembly, de novo assembly with Celera Assembler, and assembly polishing with Quiver. 1 active reference True common/protocols/preprocessing/Fetch.1.xml common/protocols/filtering/PreAssemblerSFilter.1.xml common/protocols/assembly/PreAssemblerHGAP.1.xml --overlapTolerance 100 --trimHit 50 common/protocols/assembly/CeleraAssemblerHGAP.1.xml common/protocols/referenceuploader/ReferenceUploaderHGAP.1.xml common/protocols/mapping/BLASR.1.xml common/protocols/consensus/AssemblyPolishing.1.xml Sets up inputs Filter reads for use in the pre-assembly step of HGAP, the hierarchical genome assembly process. whitelist.txt The minimum subread length. Shorter subreads will be filtered and excluded from further analysis. 500 The minimum polymerase read quality determines the quality cutoff. Polymerase reads with lower quality will be filtered and excluded from further analysis. 0.80 The minimum polymerase read length. Shorter polymerase reads will be excluded from further analysis. 500 Pre-assemble long reads as the first step of the Hierarchical Genome Assembly process (HGAp). False False False False Compute "Minimum Seed Read Length" True Minimum length of reads to use as seeds for pre-assembly 500 The -bestn and -nCandidates options should be approximately equal to the expected seed read coverage -minReadLength 200 -maxScore -1000 -bestn 24 -maxLCPLength 16 -nCandidates 100 -L --overlapTolerance 100 --trimHit 50 60 Allows partially aligned reads to participate in pre-assembled read consensus. False Trims the low-quality regions from the FASTQ sequence entries. True --qvCut=50 --minSeqLen=500 Assemble with CCS reads instead of subreads. In most cases assembling with subreads will be preferred. False This module wraps the Celera Assembler v7.0 False False False False False False False False Approximate genome size in base pairs 320000 pacbioReads True 500 Seconds to wait for runCA outputs to be copied into job dir. 600 Fold coverage to target when picking frgMinLen for assembly. Typically 15 to 25. 15 Overlapper error rate 0.06 Overlaps shorter than this length are not computed. 40 Sets the length of the seeds used by the seed and extend algorithm. 14 Enter the server path to an existing spec file True False reference sawriter -blt 8 -welter createSequenceDictionary samtools faidx BLASR maps reads to genomes by finding the highest scoring local alignment or set of local alignments between the read and the genome. The first set of alignments is found by querying an index of the reference genome, and then refining until only high scoring alignments are retained. Additional pulse metrics are loaded into the resulting cmp.h5 file to enable downstream use of the Quiver algorithm. The maximum number of matches of each read to the reference sequence that will be evaluated. maxHits should be greater than the expected number of repeats if you want to spread hits out on the genome. 10 The maximum allowed divergence of a read from the reference sequence. 30 The minimum anchor size defines the length of the read that must match against the reference sequence. 12 True True True --seed=1 --minAccuracy=0.75 --minLength=50 --useQuality DeletionQV,IPD,InsertionQV,PulseWidth,QualityValue,MergeQV,SubstitutionQV,DeletionTag The default option of loadPulses is 'byread'. Option 'bymetric' is desined to sacrifice memory for increased speed, especially for jobs of which the number of reference contigs is large. bymetric Polish a pure-PacBio assembly for maximum accuracy using the Quiver algorithm. Filter out reads with Map QV less than 10. Coverage in repeat regions shorter than the read length will be reduced. True HGAP_Assembly_Advanced.1.xml