Hi,
I have 250bp paired-end sequencing data (Illumina MiSeq, k-mer coverage ~18) of four E. coli strains. What I want to determine is how similar the strains are.
I imagine this could be done on the basis of the raw data alone, thus without trying to assemble the individual genomes (at least for a first rough approximation of similarity). Can anyone suggest to me what would be the best approach for this?
Another option would be to take the largest scaffold currently available for one strain, and map the reads of each of the strains on to this, and compare. The data is all from the same sequencing run, and on the basis of fastQ quality metric cannot by eye be held apart. It think it would be reasonable to assume that technical errors are equally distributed. Thus, after trimming and quality filtering using the same settings, dissimilarities can be assessed. For the determination of the amount of SNPs I would need to take into account the sequencing error rate though (0.80%, http://bmcgenomics.biomedcentral.com...71-2164-13-341), however, since during the assembly many sequencing errors are discarded I don't know how to disentangle the true SNPs from sequencing errors. Any suggestions how to tackle this issue are appreciated.
I have 250bp paired-end sequencing data (Illumina MiSeq, k-mer coverage ~18) of four E. coli strains. What I want to determine is how similar the strains are.
I imagine this could be done on the basis of the raw data alone, thus without trying to assemble the individual genomes (at least for a first rough approximation of similarity). Can anyone suggest to me what would be the best approach for this?
Another option would be to take the largest scaffold currently available for one strain, and map the reads of each of the strains on to this, and compare. The data is all from the same sequencing run, and on the basis of fastQ quality metric cannot by eye be held apart. It think it would be reasonable to assume that technical errors are equally distributed. Thus, after trimming and quality filtering using the same settings, dissimilarities can be assessed. For the determination of the amount of SNPs I would need to take into account the sequencing error rate though (0.80%, http://bmcgenomics.biomedcentral.com...71-2164-13-341), however, since during the assembly many sequencing errors are discarded I don't know how to disentangle the true SNPs from sequencing errors. Any suggestions how to tackle this issue are appreciated.
Comment