View Single Post
Old 04-25-2017, 06:13 PM   #19
Brian Bushnell
Super Moderator
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707

Well, it's a very rough estimate, but...

If 70% of reads are unique, then assuming an even distribution, 30% of the start sites are taken. Meaning there is one read for every 1/0.3 = 3.33 bases. For 150bp reads, that would indicate coverage of 150bp/3.33 = 45x. But since read 1 and read 2 are tracked independently, I doubled it to 90x. Then, since errors artificially inflate uniqueness calculation using this method, and given the % perfect profile, I guessed that maybe I should increase it by ~10%, so I arrived at ~100x coverage, but possibly more if the reads were lower-quality than they seemed based on the mapq.

But, those estimates were based on 150bp reads... for 100bp reads the estimate would have been 66x+, which is not too far off from 55x. I initially thought this was a metagenome because of the sharp decrease in uniqueness at the very beginning of the file, but perhaps you just have a highly repetitive genome, or lots of duplicate reads. Was this library PCR-amplified? And did you trim adapters and remove phiX (if you spiked it in) prior to running the program? Also, is this a Nextera library; or, what method did you use for fragmentation? It's unusual for a PCR-free isolate to have such a sharp decrease in uniqueness at the beginning; that indicates there is some sequence that is extremely abundant in the library. Notably, the drop is not present in the paired uniqueness, which is completely linear. I'm not entirely sure what this means.

At any rate, for an isolate, it looks like you've sequenced enough (for a diploid/haploid). Sometimes you can get a better assembly with more coverage, though, up to around 100x. And you certainly can't beat longer reads!

Last edited by Brian Bushnell; 04-25-2017 at 06:39 PM.
Brian Bushnell is offline   Reply With Quote