I'm trying to assemble a ~70M genome from a single Illumina PE library (100nt read, 160M each end, insert size=300bp).
I've tried velvet, Abyss, and SOAPdenovo with K=75. The largest N50 is 2kb, with at least 77k contigs.
My questions are:
1. Generally does one have to use at least two libraries (one shorter insert and one longer insert) to get good assembly?
2. When I only have one short read library like this, how can I get the best assembly? (a rough question I know. What I want to know is any tips in the parameter tuning? or any preprocessing?)
3. It seems that SOAPdenovo always give me worst results with very small N50. But publication suggests this is a good assembler for large genome. I don't know if I did something wrong.
4. The reported coverage is about 50X, with genome size estimated to be ~70M. But if I calculate the coverage from my reads, it would be 160M x 2 x 100/70M = ~400x. How can it be decreased to 50x? Well, fastqc did show that there is high duplication level in the raw reads (>70%). Could this be the reason?
Here are my commands and results from each assembler:
1. velvet: (N50=2k, #contigs=77k, reported coverage=50X)
2. Abyss: (Contig N50=2k, # contigs=460k; Scaffold N50=4k, #=400k)
3. SOAPdenovo: (Contig N50=170bp; Scaffold N50=300bp, #=47k)
Before assembly, I did preprocessing:
1. Use fastuniq to remove duplicates
2. Use fastq-mcf to remove adapter seqs and trim low quality ends
3. Use SOAPec to error correct the reads:
I know assembly is quite case-dependent and I'm asking quite open questions. But any suggestions would be highly appreciated! Thanks!
I've tried velvet, Abyss, and SOAPdenovo with K=75. The largest N50 is 2kb, with at least 77k contigs.
My questions are:
1. Generally does one have to use at least two libraries (one shorter insert and one longer insert) to get good assembly?
2. When I only have one short read library like this, how can I get the best assembly? (a rough question I know. What I want to know is any tips in the parameter tuning? or any preprocessing?)
3. It seems that SOAPdenovo always give me worst results with very small N50. But publication suggests this is a good assembler for large genome. I don't know if I did something wrong.
4. The reported coverage is about 50X, with genome size estimated to be ~70M. But if I calculate the coverage from my reads, it would be 160M x 2 x 100/70M = ~400x. How can it be decreased to 50x? Well, fastqc did show that there is high duplication level in the raw reads (>70%). Could this be the reason?
Here are my commands and results from each assembler:
1. velvet: (N50=2k, #contigs=77k, reported coverage=50X)
Code:
velveth Sample_name 75 -shortPaired -fastq Sample_R1R2_rmdp_trimmed_SOAPec.fastq velvetg Sample_name -cov_cutoff auto -ins_length 300 -exp_cov auto
Code:
abyss-pe n=10 name=Sample_k75 k=75 j=8 in='Sample_R1_rmdp_trimmed.fastq.cor.pair_1.fq Sample_R2_rmdp_trimmed.fastq.cor.pair_2.fq'
Code:
SOAPdenovo-127mer all -s SOAPdenovo.config -K 75 -R -f -p 8 -F -V -o Sample_k75
1. Use fastuniq to remove duplicates
2. Use fastq-mcf to remove adapter seqs and trim low quality ends
3. Use SOAPec to error correct the reads:
Code:
KmerFreq_HA -k 27 -f 1 -t 10 -L 101 -l fastqlistforSOAPec.lst -p Sample_k27 Corrector_HA -k 27 -l 2 -e 1 -w 1 -q 30 -r 45 -t 10 -j 1 -Q 33 -o 1 Sample_k27.freq.gz fastqlistforSOAPec.lst
Comment