berthenet 02-02-2018 12:05 AM

Low K-mer coverage from a SPAdes assembly
Hi everyone!

I kind of asked the question on my presentation thread, but I think this is a better place to make sure I reach people who know about my issue.

So my lab ordered sequencing for a lot of bacterial strains in order for me to do gene-by-gene approach genomics on them. The sequencing was made with NextSeq500 (Illumina), and the assembly with SPAdes. I obtain an alignment in a multi-fasta format.

Most of my assemblies look fine in terms of number of contigs (<100 contigs) once I filter out the smallest ones (<1000bp), and have a correct total size of the genome. However, for some of them the number of contigs remain really high, and when I check the length of the complete genome, I obain 3 genomes of more than 2.4Mb when I expect 1.65Mb approximately. I checked the 30 largest contigs for one of these outsider strain by doing a nblast against the NCBI database. I noticed that some of the contigs don't match the species of interest. These contigs have a low K-mer coverage (indicated in the name of the contig): around 1, against more than 200 for contigs matching the species of interest.

The cut-off between high coverage and low coverage is extremelly clear in all the samples I checked, so I was thinking simply filter out everything that is less than 1000bp and less than 50 in coverage. Do you think that is relevant ? If yes, can anyone explain to me what is in those contigs with a small coverage? What I'm getting rid of exactly? Is that contamination?

Many thanks for your help !

seb567 02-02-2018 06:57 AM

This looks like contamination.

You are expecting a 1.65 Mb genome.

The sum of contig length is 2.4 Mb.

When you searched the low-coverage contigs against the NCBI database, did they have any hits or no hits ?

If they did have good hits, you could align all your contigs against that hit to make a list of everything that align to that hit.

berthenet 02-04-2018 11:59 PM

Hi Seb,

I see your point, but not all my low coverage contigs have a hit. The ones that do don't have the exact same hit, and the hits are not very good (the largest contig matching another species matches with an identity of 90%, but on only 63% cover, the next contig has identity of 78% on only 21%)

On the 30 largest contigs which I tested, 10 were matching another species (but these were not very good matches, and not always the same species), and 9 were not matching anything. All these 19 contigs had low coverage (around 1). Contigs with high coverage (>200) were matching my species of interest with good hits.

