There is a table of 'most wanted genomes' (MWGs) from the Human Microbiome project that lists the 2698 organisms that the community generally thinks would be useful to have genome sequence. The table includes the consensus sequence of the V1V3 and/or V3V5 region of the 16s for each of these wanted organisms.
Locally I have a collection of 16s V1V3 region sequences that I would like to map to those MWG 16s regions but I am having trouble with the mapping. My initial attempt was to use ncbi blast to align the 16s regions of the MWGs to a database of my own 16s V1V3 regions. But I am seeing many of my local 16s sequences appearing as the strong, tophit for multiple of these MWGs. And I am getting 'strong' (by evalue) hits to almost every single MWG in the HMP collection, which is not what I had expected. Even when I filter the alignments to require (I think?) species level similarity of 97% across 90% of the query length I am still getting a fair number of my local 16s sequences mapping as the best hit to multiple MWGs. So basically I think my approach is too simplistic.
Could someone suggest a good way to do this? My goal is to get an idea of how many of these MWGs I have represented in my local collection of 16s sequences.
Thanks,
John Martin
Locally I have a collection of 16s V1V3 region sequences that I would like to map to those MWG 16s regions but I am having trouble with the mapping. My initial attempt was to use ncbi blast to align the 16s regions of the MWGs to a database of my own 16s V1V3 regions. But I am seeing many of my local 16s sequences appearing as the strong, tophit for multiple of these MWGs. And I am getting 'strong' (by evalue) hits to almost every single MWG in the HMP collection, which is not what I had expected. Even when I filter the alignments to require (I think?) species level similarity of 97% across 90% of the query length I am still getting a fair number of my local 16s sequences mapping as the best hit to multiple MWGs. So basically I think my approach is too simplistic.
Could someone suggest a good way to do this? My goal is to get an idea of how many of these MWGs I have represented in my local collection of 16s sequences.
Thanks,
John Martin