Hi everyone!
I kind of asked the question on my presentation thread, but I think this is a better place to make sure I reach people who know about my issue.
So my lab ordered sequencing for a lot of bacterial strains in order for me to do gene-by-gene approach genomics on them. The sequencing was made with NextSeq500 (Illumina), and the assembly with SPAdes. I obtain an alignment in a multi-fasta format.
Most of my assemblies look fine in terms of number of contigs (<100 contigs) once I filter out the smallest ones (<1000bp), and have a correct total size of the genome. However, for some of them the number of contigs remain really high, and when I check the length of the complete genome, I obain 3 genomes of more than 2.4Mb when I expect 1.65Mb approximately. I checked the 30 largest contigs for one of these outsider strain by doing a nblast against the NCBI database. I noticed that some of the contigs don't match the species of interest. These contigs have a low K-mer coverage (indicated in the name of the contig): around 1, against more than 200 for contigs matching the species of interest.
The cut-off between high coverage and low coverage is extremelly clear in all the samples I checked, so I was thinking simply filter out everything that is less than 1000bp and less than 50 in coverage. Do you think that is relevant ? If yes, can anyone explain to me what is in those contigs with a small coverage? What I'm getting rid of exactly? Is that contamination?
Many thanks for your help !
I kind of asked the question on my presentation thread, but I think this is a better place to make sure I reach people who know about my issue.
So my lab ordered sequencing for a lot of bacterial strains in order for me to do gene-by-gene approach genomics on them. The sequencing was made with NextSeq500 (Illumina), and the assembly with SPAdes. I obtain an alignment in a multi-fasta format.
Most of my assemblies look fine in terms of number of contigs (<100 contigs) once I filter out the smallest ones (<1000bp), and have a correct total size of the genome. However, for some of them the number of contigs remain really high, and when I check the length of the complete genome, I obain 3 genomes of more than 2.4Mb when I expect 1.65Mb approximately. I checked the 30 largest contigs for one of these outsider strain by doing a nblast against the NCBI database. I noticed that some of the contigs don't match the species of interest. These contigs have a low K-mer coverage (indicated in the name of the contig): around 1, against more than 200 for contigs matching the species of interest.
The cut-off between high coverage and low coverage is extremelly clear in all the samples I checked, so I was thinking simply filter out everything that is less than 1000bp and less than 50 in coverage. Do you think that is relevant ? If yes, can anyone explain to me what is in those contigs with a small coverage? What I'm getting rid of exactly? Is that contamination?
Many thanks for your help !
Comment