Seqanswers Leaderboard Ad

**Brian Bushnell** · 05-25-2014, 08:31 PM

It would seem that the answer is 200,000 unique reads, but I don't know for sure.

You could download BBTools and run "dedupe.sh" on the dataset to get an exact number:

dedupe.sh in=reads.fq out=clean.fq

It works on single-ended or paired reads; for paired it only declares them duplicates if both reads match. Also, it supports variable #edits or substitutions, if you want, though by default it just looks for exact matches.

**mastal** · 05-26-2014, 03:45 AM

Interpreting the FastQC duplication level plots is actually quite complicated.

See Simon Andrews' blog posts about this:

A new way to look at duplication in FastQC v0.11 |

http://proteo.me.uk/2013/09/a-new-way-to-look-at-duplication-in-fastqc-v0-11/

Interpreting the duplicate sequence plot in FastQC |

http://proteo.me.uk/2011/05/interpreting-the-duplicate-sequence-plot-in-fastqc/

**nucacidhunter** · 05-26-2014, 10:11 PM

Thanks Brian and mastal for your comments and suggestions. I am only interested to know the number of unique reads from sequencing my libraries and I am reluctant to use any other tool since I get FastQC with other useful information. I wonder if someone could comment if a formula like this one: "(1-Sequence Duplication level%) x total number of reads= # of unique reads" will give correct answer or should some coefficients be factored in the formula.

**Brian Bushnell** · 05-27-2014, 08:50 AM

What I meant was, you can just run dedupe once on one dataset to confirm that the formula "(1-Sequence Duplication level%) x total number of reads= # of unique reads" is correct, or possibly derive a different formula, then go back to using FastQC. Dedupe prints the exact number (and percent) of duplicates.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 55 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 51 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 45 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 55 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

How to calculate number of unique reads from FastQC "sequence duplication level"

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News