SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Bismark - A New Tool for Mapping and Analysis of Bisulfite-Seq Data fkrueger Bioinformatics 649 10-05-2018 01:43 AM
High duplication levels in FASTQC flobpf Bioinformatics 3 11-27-2013 12:28 PM
What might cause the "Sequence Duplication Levels" failures in FastQC report? elrohir610 Bioinformatics 6 05-07-2012 09:38 PM
fastqc sequence duplication level fadista Bioinformatics 4 01-11-2012 09:17 AM
Fastqc sequence duplication levels Bruce E Illumina/Solexa 1 07-29-2011 07:13 AM

Reply
 
Thread Tools
Old 12-13-2011, 06:32 AM   #1
gcarbajosa
Junior Member
 
Location: London

Join Date: May 2009
Posts: 4
Default Apparent duplication levels incongruence between bismark and fastqc with BS-Seq data

Hi all,

I am working with a BS-Seq dataset and I came across this result that puzzles me a bit.

I ran fastqc on the fastq files first and I got a estimated duplication level of 36.83% (fastqc plot attached)

Afterwards, I mapped the data using Bismark: Here's the mapping report:

Number of paired-end alignments with a unique best hit: 165375035
Mapping efficiency: 71.3%
Sequences with no alignments under any condition: 52756927
Sequences did not map uniquely: 13328411

The number of sequences that did not map uniquely is less than 10% the number of mapped sequences

So I can only think of two possibilities here:

1- Our dataset really contains a high level of polyclonality (therefore we'll have to worry about it and improve the protocol we use to prepare the BS-Seq library). This would imply that >20% of the duplicate reads are not mapped at all explaining the difference in duplication levels between fastqc and bismark. Have any bismark users come across something like this before?

2- Could it be that there is something about the way fastqc estimates the duplicate levels that artificially boosts the numbers of duplicates in our dataset? I'm not really sure about this because I used fastqc in the past and it always seemed to work really well but I wonder if there is something about bisulfite converted reads that could cause this behaviour

Thanks a lot in andvance for your answers!
Attached Images
File Type: png duplication_levels.png (15.6 KB, 33 views)
gcarbajosa is offline   Reply With Quote
Old 12-13-2011, 07:37 AM   #2
gcarbajosa
Junior Member
 
Location: London

Join Date: May 2009
Posts: 4
Default

Something more about this. Going through the SEQanswers post related to fastqc I've found a link to this page:

http://proteo.me.uk/2011/05/interpre...lot-in-fastqc/

where Simon Andrews mentions that fastqc only uses the first 50bp of each sequence to search for duplicates. I guess that since the reads in my dataset are 100bp long they duplication levels can be boosted by only considering the first 50bp when looking for identical reads. So now I'm thinking that the correct answer is the 2nd possibility
gcarbajosa is offline   Reply With Quote
Old 12-13-2011, 08:43 AM   #3
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 620
Default

Hi gcarbajosa,

As you mentioned, FastQC determines an approximate level of sequence duplication by storing the first 50bp of the first 200,000 different sequences it encounters in a sequencing file. These duplicated sequences may for example be be adapter contamination (which would not map at all in Bismark), but could also be duplicate reads that were amplified by PCR during the library construction. These reads might align perfectly well and uniquely to the genome even though they might be technical duplicates.

So essentially the number of reads mapping non-uniquely (which are being discarded) and duplicated reads is not the same thing, and Bismark does not specifically output anything regarding duplication levels. I hope this helps?
fkrueger is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:52 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO