Hi all,
I sequenced (Illumina GAII, 75 bp paired end reads, average insert 250) genomic DNA from a sample which contains both a host's and its parasite's DNA, where it is enriched with the parasite's DNA.
I'm only interested in assembling the parasite's genome, and it is a non reference genome so I'm doing denovo assembly.
As a filtering step I removed any of the reads that mapped to the host's EST data (there's no reference genome foe the host either), but that probably only removed protein coding DNA.
So I'm still stuck with non-coding DNA from the host, which I would like to remove. My thought was that since the reads are enriched with the parasite's DNA, contigs originating from the host should have low coverage.
The problem is that the assembly, produced using SOAPdenovo, uses Kmers to produce contigs and the result is that 80% of the contigs are as 100 bp short and no read maps to them so they have 0 coverage.
So my question is how do I work around this problem?
Should I use a larger Kmer size? Should I assume that these short contigs indeed represent the host's non-coding DNA and just filter them?
Thanks
I sequenced (Illumina GAII, 75 bp paired end reads, average insert 250) genomic DNA from a sample which contains both a host's and its parasite's DNA, where it is enriched with the parasite's DNA.
I'm only interested in assembling the parasite's genome, and it is a non reference genome so I'm doing denovo assembly.
As a filtering step I removed any of the reads that mapped to the host's EST data (there's no reference genome foe the host either), but that probably only removed protein coding DNA.
So I'm still stuck with non-coding DNA from the host, which I would like to remove. My thought was that since the reads are enriched with the parasite's DNA, contigs originating from the host should have low coverage.
The problem is that the assembly, produced using SOAPdenovo, uses Kmers to produce contigs and the result is that 80% of the contigs are as 100 bp short and no read maps to them so they have 0 coverage.
So my question is how do I work around this problem?
Should I use a larger Kmer size? Should I assume that these short contigs indeed represent the host's non-coding DNA and just filter them?
Thanks
Comment