Dears,
I'm trying to assemble the genome of a Drosophila species, but I'm having serious problem.
I have 4 illumina libraries: 1 paired-end library: 100bp, insert size 300bp; 3 MP libraries: 100bp, insert sizes 1.5k, 4.5k and 9k.
I started using two assembler: SOAPdenovo 2 using different kmers and IDBA-UD. In parallel I'm also trying MaSuRCA and Cabog (Celera Assembler). But, at the moment I'm still waiting results. I'm using PE or (PE+1.5kMP) for contigs and all MPs for schaffolding.
From Soap and IDBA I get assemblies with very small N50 ~160bp. Playing around with kmers and library used for contigs construction do not change results.
So I went back to the library and I plotted the kmer spectrum for the four libraries. I used Jellifish+KAT to plot the spectrum (pdf attached). The plots look quite bad. The characteristic peak is absent or masked but a very high peak of rare kmers. Also, from fastqc it looks like in the PE libraries ~30% of reads are duplicated, while 20%, 20% and 40% for 1.5k, 4.5k and 9k MPs respectively. I'm guessing there is PCRs bias.
Any advice?
Thanks really a lot.
I'm trying to assemble the genome of a Drosophila species, but I'm having serious problem.
I have 4 illumina libraries: 1 paired-end library: 100bp, insert size 300bp; 3 MP libraries: 100bp, insert sizes 1.5k, 4.5k and 9k.
I started using two assembler: SOAPdenovo 2 using different kmers and IDBA-UD. In parallel I'm also trying MaSuRCA and Cabog (Celera Assembler). But, at the moment I'm still waiting results. I'm using PE or (PE+1.5kMP) for contigs and all MPs for schaffolding.
From Soap and IDBA I get assemblies with very small N50 ~160bp. Playing around with kmers and library used for contigs construction do not change results.
So I went back to the library and I plotted the kmer spectrum for the four libraries. I used Jellifish+KAT to plot the spectrum (pdf attached). The plots look quite bad. The characteristic peak is absent or masked but a very high peak of rare kmers. Also, from fastqc it looks like in the PE libraries ~30% of reads are duplicated, while 20%, 20% and 40% for 1.5k, 4.5k and 9k MPs respectively. I'm guessing there is PCRs bias.
Any advice?
Thanks really a lot.
Comment