Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Apparent duplication levels incongruence between bismark and fastqc with BS-Seq data

    Hi all,

    I am working with a BS-Seq dataset and I came across this result that puzzles me a bit.

    I ran fastqc on the fastq files first and I got a estimated duplication level of 36.83% (fastqc plot attached)

    Afterwards, I mapped the data using Bismark: Here's the mapping report:

    Number of paired-end alignments with a unique best hit: 165375035
    Mapping efficiency: 71.3%
    Sequences with no alignments under any condition: 52756927
    Sequences did not map uniquely: 13328411

    The number of sequences that did not map uniquely is less than 10% the number of mapped sequences

    So I can only think of two possibilities here:

    1- Our dataset really contains a high level of polyclonality (therefore we'll have to worry about it and improve the protocol we use to prepare the BS-Seq library). This would imply that >20% of the duplicate reads are not mapped at all explaining the difference in duplication levels between fastqc and bismark. Have any bismark users come across something like this before?

    2- Could it be that there is something about the way fastqc estimates the duplicate levels that artificially boosts the numbers of duplicates in our dataset? I'm not really sure about this because I used fastqc in the past and it always seemed to work really well but I wonder if there is something about bisulfite converted reads that could cause this behaviour

    Thanks a lot in andvance for your answers!
    Attached Files

  • #2
    Something more about this. Going through the SEQanswers post related to fastqc I've found a link to this page:



    where Simon Andrews mentions that fastqc only uses the first 50bp of each sequence to search for duplicates. I guess that since the reads in my dataset are 100bp long they duplication levels can be boosted by only considering the first 50bp when looking for identical reads. So now I'm thinking that the correct answer is the 2nd possibility

    Comment


    • #3
      Hi gcarbajosa,

      As you mentioned, FastQC determines an approximate level of sequence duplication by storing the first 50bp of the first 200,000 different sequences it encounters in a sequencing file. These duplicated sequences may for example be be adapter contamination (which would not map at all in Bismark), but could also be duplicate reads that were amplified by PCR during the library construction. These reads might align perfectly well and uniquely to the genome even though they might be technical duplicates.

      So essentially the number of reads mapping non-uniquely (which are being discarded) and duplicated reads is not the same thing, and Bismark does not specifically output anything regarding duplication levels. I hope this helps?

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Strategies for Sequencing Challenging Samples
        by seqadmin


        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
        03-22-2024, 06:39 AM
      • seqadmin
        Techniques and Challenges in Conservation Genomics
        by seqadmin



        The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

        Avian Conservation
        Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
        03-08-2024, 10:41 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, Yesterday, 06:37 PM
      0 responses
      11 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, Yesterday, 06:07 PM
      0 responses
      10 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 03-22-2024, 10:03 AM
      0 responses
      51 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 03-21-2024, 07:32 AM
      0 responses
      67 views
      0 likes
      Last Post seqadmin  
      Working...
      X