Seqanswers Leaderboard Ad

**nucacidhunter** · 08-18-2014, 11:17 AM

I wonder what is read duplication rate and the number of reads.

**Brian Bushnell** · 08-18-2014, 03:50 PM

The duplication rate appears very low (considering it's only a ~3Mbp organism). Here's a plot of read uniqueness for the first 10m read pairs (out of 124m total pairs):

The way to interpret this... each read is examined for its first 31-mer and a random 31-mer. These are added to a hashtable. If they were already present, the read is considered non-unique; otherwise, it is considered unique. Errors will inflate the apparent uniqueness. The cumulative ratio of unique vs non-unique reads is reported every 25k reads. The more nonuniform the library, the faster the value drops. There are multiple lines because I track "first" and "random" separately, and I also track read 1 and read 2 both separately and combined.

The wavyness here is probably due to some problem with the optics, correlating with individual image frames.

Attached Files

NextSeq_Uniqueness.png (37.6 KB, 116 views)

**nucacidhunter** · 08-18-2014, 04:25 PM

I would suggest first to check for sequencer faults which person running the machine should be able to do it. If that is ruled out as a possible cause, I would look next to the library prep and its diversity. The wavyness in base frequency looks similar to what I have seen with low diversity mate pair libraries where a library with below 10M unique fragments have been sequenced in 100sM (though the frequency was larger than 3) and also low diversity amplicon libraries. Out of curiosity, how the duplication rate could be low. In a 3 Mb genome there is only possibility of obtaining 3M unique fragments (at least in this case in initial 100 bp). If this library is sequenced to a depth of 124M reads there would be high level of duplication.

**Brian Bushnell** · 08-18-2014, 08:48 PM

Originally posted by nucacidhunter View Post

Out of curiosity, how the duplication rate could be low. In a 3 Mb genome there is only possibility of obtaining 3M unique fragments (at least in this case in initial 100 bp). If this library is sequenced to a depth of 124M reads there would be high level of duplication.

So, this is a 2x151bp library; as expected, after 10M read pairs, the number of read1 with a unique first 31-mer drops to around 35%. This is consistent with a high uniqueness - if every starting location on the genome was used, you could only get up to around 31% uniqueness (it's actually about 3.09 Mbp). The fact that some reads have errors pushes it higher to 35% but it's still good.

But there's also pair uniqueness, for which I use a hash of the middle 31-mer in read 1 and read 2. This represents the fraction of read pairs with a unique start+stop combination, and thus is a much better measure of library duplication rate. By that metric, of the first 10 million read pairs, 99% of them are unique, which indicates the library has a very low duplication rate. Though certainly if I extended the graph all the way to 124 million pairs I would expect that to drop a bit.

Topics	Statistics	Last Post
A Closer Look at the Enigmatic Genomes of Oikopleura dioica by seqadmin Started by seqadmin, Yesterday, 06:35 AM	0 responses 14 views 0 likes	Last Post by seqadmin Yesterday, 06:35 AM
Advanced Epigenome Editing Platform Explores Gene Regulation Mechanisms by seqadmin Started by seqadmin, 05-09-2024, 02:46 PM	0 responses 19 views 0 likes	Last Post by seqadmin 05-09-2024, 02:46 PM
Telomere Maintenance by PARP1: A New Perspective in Cancer Research by seqadmin Started by seqadmin, 05-07-2024, 06:57 AM	0 responses 18 views 0 likes	Last Post by seqadmin 05-07-2024, 06:57 AM
Enhanced Neoantigen Detection: Introducing NeoHunter by seqadmin Started by seqadmin, 05-06-2024, 07:17 AM	0 responses 19 views 0 likes	Last Post by seqadmin 05-06-2024, 07:17 AM

Seqanswers Leaderboard Ad

Announcement

Sawtooth base frequency, wavy insert size histograms.

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News