View Single Post
Old 12-10-2014, 02:14 PM   #2
Brian Bushnell
Super Moderator
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707

I have a program for plotting library uniqueness as you go through the reads. The graphs look like this:

It works by pulling kmers from each input read, and testing whether it has been seen before, then storing it in a table.

The bottom line, "first", tracks whether the first kmer of the read has been seen before (independent of whether it is read 1 or read 2).

The top line, "pair", indicates whether a combined kmer from both read 1 and read 2 has been seen before. The other lines are generally safe to ignore but they track other things, like read1- or read2-specific data, and random kmers versus the first kmer.

It plots a point every X reads (configurable, default 25000).

In noncumulative mode (default), a point indicates "for the last X reads, this percentage had never been seen before". In this mode, once the line hits zero, sequencing more is not useful.

In cumulative mode, a point indicates "for all reads, this percentage had never been seen before", but still only one point is plotted per X reads.

Sample command line: in=reads.fq out=histogram.txt

Note that the lines are not perfectly smooth; the little peaks are caused by high-error tiles. But it's still useful in that it allows assessment of a library that lacks a reference.
Attached Images
File Type: png uniqueness.png (31.6 KB, 356 views)

Last edited by Brian Bushnell; 12-10-2014 at 05:02 PM.
Brian Bushnell is offline   Reply With Quote