I recently completed resequencing of a bacterial strain with a complete reference genome. The HGAP3 assembly resulted in a single contig, but there were several areas with lowercase bases (~500 bp each). SMRTview showed that I had uniform coverage of unambiguously mapped subreads spanning these regions. No big deal, I thought--I'll just convert them to all caps and run another pass through quiver, but the same regions came back lowercase, suggesting quiver didn't touch them at all.
What's interesting is that if I BLAST one lowercase region with ~200bp flanking, I hit 960/961 bases from the published reference strain (with the single mismatch being a potential deletion in a homopolymer stretch of 8 C's). Most (but not all) of the other lowercase regions seem to have one deletion at a homopolymer with respect to the published reference. Could homopolymers alone be throwing off the read mapping and somehow tagging that local region as "suspect"? Either way I don't think these lowercase regions are worrisome, but I'm still confused as to how they arise and how quiver behaves with respect to them.
My questions are:
1) what causes quiver to produce a consensus with areas of lowercase bases, even if subreads unambiguously map to these regions without dips or spikes in coverage? Did these areas not get polished at all--and if so, how can I determine consensus accuracy for the genome if indels might still be present in lowercase regions?
2) do lowercase bases in the consensus.fasta output factor into the calculation of consensus accuracy?
3) does quiver ignore lowercase bases in a reference that is uploaded for the RS_Resequencing pipeline--meaning should I make sure all bases are uppercase before uploading a reference to SMRTportal?
What's interesting is that if I BLAST one lowercase region with ~200bp flanking, I hit 960/961 bases from the published reference strain (with the single mismatch being a potential deletion in a homopolymer stretch of 8 C's). Most (but not all) of the other lowercase regions seem to have one deletion at a homopolymer with respect to the published reference. Could homopolymers alone be throwing off the read mapping and somehow tagging that local region as "suspect"? Either way I don't think these lowercase regions are worrisome, but I'm still confused as to how they arise and how quiver behaves with respect to them.
My questions are:
1) what causes quiver to produce a consensus with areas of lowercase bases, even if subreads unambiguously map to these regions without dips or spikes in coverage? Did these areas not get polished at all--and if so, how can I determine consensus accuracy for the genome if indels might still be present in lowercase regions?
2) do lowercase bases in the consensus.fasta output factor into the calculation of consensus accuracy?
3) does quiver ignore lowercase bases in a reference that is uploaded for the RS_Resequencing pipeline--meaning should I make sure all bases are uppercase before uploading a reference to SMRTportal?
Comment