SEQanswers (
-   Pacific Biosciences (
-   -   FALCON assembler (

oakeley 01-18-2014 02:41 AM

FALCON assembler
I am trying to figure out the new diploid assembler (FALCON) from PacBio. I have a really silly question. The first step parameters (according to devnet) are:

python queries.fofn targets.fofn m4.fofn 72 0 16 8 64 50 50 | > p-reads-0.fa

There are three "file of files" requested but it is unclear which smrtcell files need to be in them. I guess that one file should have the "bax.h5" files, another the "bas.h5" (perhaps) but after that I am a bit stuck...

If anyone has got this to work could you post an example of which files should be linked in the queries, targets and m4 files?



phenotype 01-21-2014 05:56 PM

The developer says he is working on a step-by-step tutorial, but the short answer is that the three fofn files are generated by the script from HBAR-DTK repo on github, so you can check it out to see what it is doing.

rhall 01-22-2014 09:58 AM

I'm not aware of anyone other than the developer that has ran this, you are definitely on the bleeding edge. I'm going to give it a try myself, will post with my experiences. As the previous poster pointed out the first step is to generate the overlap/alignment information for the raw reads using

curious.genome 02-14-2014 06:58 AM

Do we have any updates on this ? Has the OP figured out how to get FALCON working ?

rhall 02-14-2014 11:30 AM

I have been using FALCON, it is relatively straight forward, my notes:

Install HBAR-DTK into a virtual env-

Then install FALCON, I had to correct the installed versions of pyparsing and rdflib:

pip install pyparsing==1.5.7
pip install rdflib==4.0.1
pip install git+


cp <SMRT_analysis>/analysis/bin/sawriter <virtual env>/bin/
Then run using the following cfg file, note a lot of the options are not required for FALCON, but I've left them in:

# list of files of the initial bas.h5 files
input_fofn = input.fofn

# The length cutoff used for seed reads used for initial mapping
length_cutoff = 6000

# The length cutoff used for seed reads usef for pre-assembly
length_cutoff_pr = 6000

# The read quality cutoff used for seed reads
RQ_threshold = 0.75

# SGE job option for distributed mapping
sge_option_dm = -pe smp 8 -q secondary

# SGE job option for m4 filtering
sge_option_mf = -pe smp 4 -q secondary

# SGE job option for pre-assembly
sge_option_pa = -pe smp 16 -q secondary

# SGE job option for CA
sge_option_ca = -pe smp 4 -q secondary

# SGE job option for Quiver
sge_option_qv = -pe smp 16 -q secondary

# SGE job option for "qsub -sync y" to sync jobs in the different stages
sge_option_ck = -pe smp 1 -q secondary

sge_option_qf = -pe smp 8 -q secondary

# blasr for initial read-read mapping for each chunck (do not specific the "-out" option).
# One might need to tune the bestn parameter to match the number of distributed chunks to get more optimized results
blasr_opt = -nCandidates 50 -minMatch 12 -maxLCPLength 15 -bestn 24 -minPctIdentity 70.0 -maxScore -1000 -nproc 8

#This is used for running quiver, not required for FALCON
SEYMOUR_HOME = <SMRT Analysis install>

#The number of best alignment hits used for pre-assembly
#It should be about the same as the final PLR coverage, slight higher might be OK.
bestn = 36

# target choices are "pre_assembly", "draft_assembly", "all"
# "pre_assembly" : generate pre_assembly for any long read assembler to use
# "draft_assembly": automatic submit CA assembly job when pre-assembly is done
# "all" : submit job for using Quiver to do final polish
target = mapping

# number of chunks for distributed mapping
preassembly_num_chunk = 8

# number of chunks for pre-assembly.
# One might want to use bigger chunk data sizes (smaller dist_map_num_chunk) to
# take the advantage of the suffix array index used by blasr
dist_map_num_chunk = 2

# "tmpdir" is for preassembly. A lot of small files are created and deleted during this process.
# It would be great to use ramdisk for this. Set tmpdir to a NFS mount will probably have very bad performance.
tmpdir = /tmp

# "big_tmpdir" is for quiver, better in a big disk
big_tmpdir = /tmp

# various trimming parameters
min_cov = 8
max_cov = 64
trim_align = 50
trim_plr = 50

# number of processes used by by blasr during the preassembly process
q_nproc = 16


python <virtual env>/bin/ HBAR.cfg
You should now have the m4 file for input into FALCON.

To run on a single node as separate jobs consecutively, note this can be distributed using a queuing system:

for i in {0..15}; do
python <virtual env>/bin/ ./0-fasta_files/queries.fofn ./0-fasta_files/targets.fofn ./2-preads-falcon/m4_files.fofn 72 ${i} 16 8 64 50 50 > p-reads-${i}.fasta

Join all the preassembled reads:

cat p-reads-*.fasta > preads.fasta
Generate overlaps:
Code: --min_len 4000 --n_core 24 --d_core 3 preads.fa > preads.ovlp
Code: preads.ovlp  preads.fa
Hopefully this will allow people to get started with FALCON, a better howto is in the works.

All times are GMT -8. The time now is 10:22 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.