I have been trying to assemble some sequences using Masurca, and it runs to completion, but I recently noticed something amiss.
For input, I have three paired-end Illumina libraries, three single-end Illumina libraries, and a couple hundred individual FASTA sequences.
The FASTA sequences were converted to *.FRG format. The individual sequences range in length from ~150bp to over 40kbp.
The FASTA sequences, almost as a rule, do not seem to be passed through to the final output assemblies. I would understand (although be a bit surprised) if the longer FASTA files could not have sequences added to them by Masurca, but I'm a little disappointed/concerned that known, large, contiguous blocks of sequence are not being passed through to my final assemblies.
Has anyone else run into this issue? I've seen it with v2.0.1.4 and v. 2.2.1. Attempts to contact the program authors haven't worked out to date.
An example config file:
For input, I have three paired-end Illumina libraries, three single-end Illumina libraries, and a couple hundred individual FASTA sequences.
The FASTA sequences were converted to *.FRG format. The individual sequences range in length from ~150bp to over 40kbp.
The FASTA sequences, almost as a rule, do not seem to be passed through to the final output assemblies. I would understand (although be a bit surprised) if the longer FASTA files could not have sequences added to them by Masurca, but I'm a little disappointed/concerned that known, large, contiguous blocks of sequence are not being passed through to my final assemblies.
Has anyone else run into this issue? I've seen it with v2.0.1.4 and v. 2.2.1. Attempts to contact the program authors haven't worked out to date.
An example config file:
Code:
DATA PE= D1 533 105 /ifs/bulk/rdouglas/bowtie1_Bseqs_unaligned_perfect/run01_1_paired_notB73perfect.fastq /ifs/bulk/rdouglas/bowtie1_Bseqs_unaligned_perfect/run01_2_paired_notB73perfect.fastq PE= D2 747 363 /ifs/bulk/rdouglas/bowtie1_Bseqs_unaligned_perfect/run02_1_paired_notB73perfect.fastq /ifs/bulk/rdouglas/bowtie1_Bseqs_unaligned_perfect/run02_2_paired_notB73perfect.fastq PE= D3 550 83 /ifs/bulk/rdouglas/Masurca/paired_R1_B_repeat.fastq.gz /ifs/bulk/rdouglas/Masurca/paired_R2_B_repeat.fastq.gz PE= D4 95 15 /ifs/bulk/rdouglas/bowtie1_Bseqs_unaligned_perfect/unpaired_notB73perfect.fastq PE= D5 290 20 /ifs/bulk/rdouglas/seq_496/unpaired_output_R1_Q20.fastq PE= D6 290 20 /ifs/bulk/rdouglas/seq_496/unpaired_output_R2_Q20.fastq OTHER= /ifs/bulk/rdouglas/Masurca/B_GSS.frg OTHER= /ifs/bulk/rdouglas/Masurca/nate_ellis_seqs.frg OTHER= /ifs/bulk/rdouglas/Masurca/theuri_seqs.frg END PARAMETERS #this is k-mer size for deBruijn graph values between 25 and 101 are supported, auto will compute the optimal size based on the read data and GC content GRAPH_KMER_SIZE=auto #set this to 1 for Illumina-only assemblies and to 0 if you have 2x or more long (Sanger, 454) reads USE_LINKING_MATES=0 #this parameter is useful if you have too many jumping library mates. Typically set it to 60 for bacteria and something large (300) for mammals LIMIT_JUMP_COVERAGE = 150 #these are the additional parameters to Celera Assembler. do not worry about performance, number or processors or batch sizes -- these are computed automatically. for mammals do not set cgwErrorRate above 0.15!!! CA_PARAMETERS = ovlMerSize=30 cgwErrorRate=0.25 ovlMemory=4GB #minimum count k-mers used in error correction 1 means all k-mers are used. one can increase to 2 if coverage >100 KMER_COUNT_THRESHOLD = 1 #auto-detected number of cpus to use NUM_THREADS=24 #this is mandatory jellyfish hash size JF_SIZE=900000000 #this specifies if we do (1) or do not (0) want to trim long runs of homopolymers (e.g. GGGGGGGG) from 3' read ends, use it for high GC genomes DO_HOMOPOLYMER_TRIM=1 END