Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • some basic questions about duplicate removal ?

    Hi all,

    I am using the GATK pipeline for pre processing bam files after alignment with bwa mem. The original bam files after alignment shows I have (samtools flagstat command) - 173,460,757 reads (this is deep sequening exome data captured with agilent sure select 50 mb).

    But after removing duplicates with Picard, I am left with 14,651,238 reads !! Thats like mere 20X coverage.

    1. I would like to know whether this is normal in exome seq to find such huge amount duplicates? And some of the threads on other forums say its not wise to remove duplicates from deep sequencing data. Can anyone provide me some suggestions on this, like how you guys proceed in such scenario ?

    2. And what is the difference between marking duplicates and removing duplicates ? I know marking adds a tag instead of completely removing the read. But, if the duplicate marked reads are not used in any of the downstream steps (like SNP calling) why is it suggested to simply marking it instead of removing it?

    3. And while calculating coverage do we have to consider duplicate reads as well (original bam) or the final bam file with dups removed ?

    Thank you.

  • #2
    If you had low library complexity due to insufficient DNA, overamplification, contamination, or highly-biased capture, a lot of duplicates will be present. Sounds like there were problems with your library prep and maybe it should be redone; that level is much higher than I'd expect. But, just run pileup and see if there is enough coverage for whatever you're doing, which depends on the fraction of the area covered to at least X depth rather than the average coverage.

    PCR duplicates should be removed before calling variations. But I would suggest removing only exact duplicate reads, rather than anything mapping to the same location even if they have some different base calls. And just to clarify, are these paired reads that you're removing based on both reads mapping to the same location?

    Removing duplicates rather than marking them is more efficient as downstream programs don't need to process as much data. But, you can use marked duplicates to generate consensus if you want, when reads are low quality.

    I use the unique coverage when calling variations as it's more relevant.

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Strategies for Sequencing Challenging Samples
      by seqadmin


      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
      03-22-2024, 06:39 AM
    • seqadmin
      Techniques and Challenges in Conservation Genomics
      by seqadmin



      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

      Avian Conservation
      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
      03-08-2024, 10:41 AM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, 03-27-2024, 06:37 PM
    0 responses
    12 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 03-27-2024, 06:07 PM
    0 responses
    11 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 03-22-2024, 10:03 AM
    0 responses
    53 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 03-21-2024, 07:32 AM
    0 responses
    69 views
    0 likes
    Last Post seqadmin  
    Working...
    X