Hi,
I'm working on de novo assembly of an insect genome. Our paired-end libraries were made from 10 ng of sheared DNA using a Rubicon ThruPLEX kit with 9 cycles of PCR. The bioanalyzer trace shows a mean insert size of 476 bp, and we sequenced around 120 million 125 base read-pairs from this library. We're expecting a genome size of around 500 Mbp (but that's a pretty rough guess).
I'm having trouble getting contiguous assemblies. Among others, I've tried edena (L50 = 287 bp as reported by BBTools stats.sh) and velvet (L50 = 482 bp). I'm new to de novo assembly but 9 cycles of PCR sounds like a lot to me, and I'm wondering if the library complexity is too low. Also, fastqc reports a GC content of around 30 % which I know can exacerbate PCR bias.
To troubleshoot, I'm looking at the before and after kmer-frequency histograms generated by BBNorm during normalisation (below). I can't see a peak in either histogram, but I'm not sure what that means. Can anyone help me interpret these plots or suggest further troubleshooting steps?
In case it's relevant, the processing I did before assembly is: quality trimming (Q < 30 at 3´ end) and adaptor trimming (TruSeq indexed adaptor and TruSeq universal adaptor) with cutadapt; contaminant filtering using PhiX, sequencing_artifacts and adapters_no_transposase references (BBDuk); normalisation and error correction to target k-mer coverage of 57 (BBNorm).
Please let me know if any more information would help.
Thanks for reading,
Tom
I'm working on de novo assembly of an insect genome. Our paired-end libraries were made from 10 ng of sheared DNA using a Rubicon ThruPLEX kit with 9 cycles of PCR. The bioanalyzer trace shows a mean insert size of 476 bp, and we sequenced around 120 million 125 base read-pairs from this library. We're expecting a genome size of around 500 Mbp (but that's a pretty rough guess).
I'm having trouble getting contiguous assemblies. Among others, I've tried edena (L50 = 287 bp as reported by BBTools stats.sh) and velvet (L50 = 482 bp). I'm new to de novo assembly but 9 cycles of PCR sounds like a lot to me, and I'm wondering if the library complexity is too low. Also, fastqc reports a GC content of around 30 % which I know can exacerbate PCR bias.
To troubleshoot, I'm looking at the before and after kmer-frequency histograms generated by BBNorm during normalisation (below). I can't see a peak in either histogram, but I'm not sure what that means. Can anyone help me interpret these plots or suggest further troubleshooting steps?
In case it's relevant, the processing I did before assembly is: quality trimming (Q < 30 at 3´ end) and adaptor trimming (TruSeq indexed adaptor and TruSeq universal adaptor) with cutadapt; contaminant filtering using PhiX, sequencing_artifacts and adapters_no_transposase references (BBDuk); normalisation and error correction to target k-mer coverage of 57 (BBNorm).
Please let me know if any more information would help.
Thanks for reading,
Tom
Comment