SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Two peaks on FastQC plot "Per sequence GC content" Tommyliu Illumina/Solexa 20 12-30-2017 08:27 AM
Strange FastQC "Per base sequence content report" tu.le Bioinformatics 10 12-23-2013 04:09 PM
What might cause the "Sequence Duplication Levels" failures in FastQC report? elrohir610 Bioinformatics 6 05-07-2012 09:38 PM
fastqc sequence duplication level fadista Bioinformatics 4 01-11-2012 09:17 AM
FastQC "Per Base Sequence Content": systematic deviation at 3' end of reads d f Illumina/Solexa 4 09-28-2010 09:46 AM

Reply
 
Thread Tools
Old 05-25-2014, 07:25 PM   #1
nucacidhunter
Jafar Jabbari
 
Location: Melbourne

Join Date: Jan 2013
Posts: 1,232
Default How to calculate number of unique reads from FastQC "sequence duplication level"

I wondering if there is any formula to estimate number of unique reads from FastQC Sequence Duplication Level output. For instance, in a set of 1M reads with 80% sequence duplication level what would be the estimated minimum and maximum of unique reads?
nucacidhunter is offline   Reply With Quote
Old 05-25-2014, 08:31 PM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

It would seem that the answer is 200,000 unique reads, but I don't know for sure.

You could download BBTools and run "dedupe.sh" on the dataset to get an exact number:

dedupe.sh in=reads.fq out=clean.fq

It works on single-ended or paired reads; for paired it only declares them duplicates if both reads match. Also, it supports variable #edits or substitutions, if you want, though by default it just looks for exact matches.
Brian Bushnell is offline   Reply With Quote
Old 05-26-2014, 03:45 AM   #3
mastal
Senior Member
 
Location: uk

Join Date: Mar 2009
Posts: 667
Default

Interpreting the FastQC duplication level plots is actually quite complicated.

See Simon Andrews' blog posts about this:

http://proteo.me.uk/2013/09/a-new-wa...-fastqc-v0-11/

http://proteo.me.uk/2011/05/interpre...lot-in-fastqc/
mastal is offline   Reply With Quote
Old 05-26-2014, 10:11 PM   #4
nucacidhunter
Jafar Jabbari
 
Location: Melbourne

Join Date: Jan 2013
Posts: 1,232
Default

Thanks Brian and mastal for your comments and suggestions. I am only interested to know the number of unique reads from sequencing my libraries and I am reluctant to use any other tool since I get FastQC with other useful information. I wonder if someone could comment if a formula like this one: "(1-Sequence Duplication level%) x total number of reads= # of unique reads" will give correct answer or should some coefficients be factored in the formula.
nucacidhunter is offline   Reply With Quote
Old 05-27-2014, 08:50 AM   #5
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

What I meant was, you can just run dedupe once on one dataset to confirm that the formula "(1-Sequence Duplication level%) x total number of reads= # of unique reads" is correct, or possibly derive a different formula, then go back to using FastQC. Dedupe prints the exact number (and percent) of duplicates.
Brian Bushnell is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:35 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO