View Single Post
Old 11-02-2016, 11:14 AM   #41
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Quote:
Originally Posted by evanname View Post
Brian, thank you so much for the excellent tools!

Is it possible to say at what level the error correction would be able to distinguish between sequencing errors and heterogeneity in the source sample?

For example, if the source was a 500bp PCR product and 2% of the molecules had a substitution at base 100, would BBnorm flag that as an error? Is there an approximate percent heterogeneity at any particular base that serves as the dividing line between 'error' and 'SNP'?

Thanks!
Sorry for the very late reply, but anyway -

I recommend using Tadpole for error-correction now; it substantially better than BBNorm because it uses exact kmer counts and algorithms designed to take advantage of the exact counts. I now only use BBNorm for normalization and plotting kmer-frequency histograms of datasets too big to fit into memory, but not for error-correction.

I don't recommend doing error-correction at all on data for which you are hoping to find rare SNPs. That said, by default, BBNorm determines a base to be in error if there is at least a 1:140 ratio of kmer counts between it and the adjacent kmers, so a 2% SNP should be safe. Tadpole, on the other hand, defaults to a 1:16 ratio for detecting errors, which is much more aggressive and would wipe out a 2% SNP. Why is it more aggressive? Well... I tried to optimize the parameters for the best Spades assemblies, and Spades seems to perform best with pretty aggressive error-correction. You can change that threshold, though.
Brian Bushnell is offline   Reply With Quote