Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Apparent duplication levels incongruence between bismark and fastqc with BS-Seq data

    Hi all,

    I am working with a BS-Seq dataset and I came across this result that puzzles me a bit.

    I ran fastqc on the fastq files first and I got a estimated duplication level of 36.83% (fastqc plot attached)

    Afterwards, I mapped the data using Bismark: Here's the mapping report:

    Number of paired-end alignments with a unique best hit: 165375035
    Mapping efficiency: 71.3%
    Sequences with no alignments under any condition: 52756927
    Sequences did not map uniquely: 13328411

    The number of sequences that did not map uniquely is less than 10% the number of mapped sequences

    So I can only think of two possibilities here:

    1- Our dataset really contains a high level of polyclonality (therefore we'll have to worry about it and improve the protocol we use to prepare the BS-Seq library). This would imply that >20% of the duplicate reads are not mapped at all explaining the difference in duplication levels between fastqc and bismark. Have any bismark users come across something like this before?

    2- Could it be that there is something about the way fastqc estimates the duplicate levels that artificially boosts the numbers of duplicates in our dataset? I'm not really sure about this because I used fastqc in the past and it always seemed to work really well but I wonder if there is something about bisulfite converted reads that could cause this behaviour

    Thanks a lot in andvance for your answers!
    Attached Files

  • #2
    Something more about this. Going through the SEQanswers post related to fastqc I've found a link to this page:



    where Simon Andrews mentions that fastqc only uses the first 50bp of each sequence to search for duplicates. I guess that since the reads in my dataset are 100bp long they duplication levels can be boosted by only considering the first 50bp when looking for identical reads. So now I'm thinking that the correct answer is the 2nd possibility

    Comment


    • #3
      Hi gcarbajosa,

      As you mentioned, FastQC determines an approximate level of sequence duplication by storing the first 50bp of the first 200,000 different sequences it encounters in a sequencing file. These duplicated sequences may for example be be adapter contamination (which would not map at all in Bismark), but could also be duplicate reads that were amplified by PCR during the library construction. These reads might align perfectly well and uniquely to the genome even though they might be technical duplicates.

      So essentially the number of reads mapping non-uniquely (which are being discarded) and duplicated reads is not the same thing, and Bismark does not specifically output anything regarding duplication levels. I hope this helps?

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Current Approaches to Protein Sequencing
        by seqadmin


        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
        04-04-2024, 04:25 PM
      • seqadmin
        Strategies for Sequencing Challenging Samples
        by seqadmin


        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
        03-22-2024, 06:39 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, 04-11-2024, 12:08 PM
      0 responses
      18 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 10:19 PM
      0 responses
      22 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 09:21 AM
      0 responses
      17 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-04-2024, 09:00 AM
      0 responses
      49 views
      0 likes
      Last Post seqadmin  
      Working...
      X