View Single Post
Old 02-12-2015, 05:44 PM   #9
Brian Bushnell
Super Moderator
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707

Hi Arash,

For each read, the first kmer is created and a kmer from a random location is created. Each of these kmers is looked up in a table to determine if it has been seen before. There is a separate table for first kmers and for random kmers; if you are using paired reads, there are also separate tables for read 1 and read 2. If the kmer has not been seen before, that read is considered "unique" for that metric and the kmer is stored. Otherwise the read is considered "non-unique". Every 25000 reads (by default) a row is printed showing the unique rate. In cumulative mode (which I personally never use!) the numbers in a row apply to all reads (so you can never reach zero!); in noncumulative mode, the number applies to only the last 25000 reads (so you will reach 0% uniqueness as soon as you get a batch of 25000 consecutive reads that have all been seen before).

"First" column is the percent of reads in which the first kmer has never been seen.
"Rand" column is the percent of reads in which a specific randomly-selected kmer has never been seen.
"Pair" column uses a hash of a specific kmer in read 1 and read 2 that has a fixed position, chosen to have a minimal error rate. Meaning that it reflects the number of unique pairs that have been seen before.

I wrote this tool, and I like it, but I designed it largely to other people's specifications so some of the defaults are a bit odd in my opinion, like the "rand" columns - I typically ignore those!

If you run in noncumulative mode, which I recommend, then you will gain no benefit from additional sequencing once the "pair" column approaches zero (for paired reads) or once the "first" column approaches zero (for single-ended reads). With paired reads, "first" will approach zero way before "pair", and once that happens, you are no longer generating unique reads, just reads that you have seen before but with new insert sizes. In general, there is no reason to sequence further once "first" approaches zero in non-cumulative mode!

However, this tool relies on high data quality. If you have low quality data with substitution errors, or very short inserts such that adapters occur in the middle of the reads, the tool will overestimate uniqueness and never reach zero. For example - if 30% of your reads have an error in the first K bases (K is by default 25), then rather than asymptotically approaching 0% uniqueness, it will approach 30% uniqueness, because kmers with errors in them will usually never have been seen before with that specific error. Mapping-based approaches do not have this problem. So, in practice, this program is ideal for high quality data, but mapping is better for low-quality data. All the little spikes in the picture I posted above are due to a bunch of reads that, for whatever reason (like a bubble in the flow cell), had low quality; if the reads were all error-free, the line would be perfectly smooth.

In summary:

1) Don't use cumulative mode for determining how much to sequence; it's only for calculating the total number of unique reads in a dataset.
2) Ignore the rand column.
3) This tool only provides useful information from decent-quality data; for very low quality data (either a high error rate [under Q15], or very short insert sizes) you need to use mapping.
4) You don't need to sequence more once the "first" column approaches zero. How close it approaches depends on your budget and needs; at 50% uniqueness, with even coverage and 100bp reads, you would have a around 100x average coverage.

In some situations, like RNA-seq, single-cell, or metagenomes, in which the sequences have an exponential coverage distribution, you will NEVER reach zero.

Brian Bushnell is offline   Reply With Quote