Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • fongchun
    Member
    • May 2011
    • 55

    Ignore PCR Duplicates in CollectTargetedPcrMetrics

    Hi all,

    I have a set of amplicon target sequencing libraries and I am trying to generate some basic metrics (e.g. number of reads, coverage of amplicons, etc) using the Picard Tools CollectTargetedPcrMetrics.jar. What I've noticed is that the tool ignores reads which have been marked as duplicates in its calculation of the statistics such as coverage of an amplicon. I was wondering if there was a way to get CollectTargetedPcrMetrics to ignore PCR duplicates. I wasn't able to find any parameters about this ...

    The libraries have been aligned with bwa and then mark duplicate. I suppose one option could be to not mark duplicates, but that seems like a roundabout way of solving this.

    Anyone else encountered this question before? Also, while we are on the topic what is the difference in the amplicon interval list and the target interval list? The reason I ask is that I passed in my intervals list file to the PER_TARGET_COVERAGE parameter and I ended up getting negative mean coverage which doesn't make sense to me (I didn't mark duplicates for this one). Anyone encountered this?

    Thanks,
    Last edited by fongchun; 02-16-2013, 11:31 AM. Reason: Added an additional question.
  • Bukowski
    Senior Member
    • Jan 2010
    • 388

    #2
    Why are you marking duplicates in an amplicon based assay? I'm just curious..

    Comment

    • fongchun
      Member
      • May 2011
      • 55

      #3
      Originally posted by Bukowski View Post
      Why are you marking duplicates in an amplicon based assay? I'm just curious..
      Just part of our standard pipeline that we used to analyse all sequencing libraries. Like I mentioned, we could just build a separate analyses pipeline just for this, but it seems odd there isn't simply a parameter to just ignore PCR duplicates....

      Comment

      • Bukowski
        Senior Member
        • Jan 2010
        • 388

        #4
        Originally posted by fongchun View Post
        Just part of our standard pipeline that we used to analyse all sequencing libraries. Like I mentioned, we could just build a separate analyses pipeline just for this, but it seems odd there isn't simply a parameter to just ignore PCR duplicates....
        Marking duplicates when there is PCR involved seems counter-intuitive. Wont that lead to lots of things with the same start and stop position which will appear to be duplicated? If that's part of the design, why remove them?

        Edit:

        I should clarify this. I do a lot of in-solution capture analysis, and I de-duplicate the data if I'm using (for instance) SureSelect. But if the experiment is HaloPlex I don't - because de-duplicating the data removes data that is there because of the design - it's unavoidable to have data that matches the characteristics of 'duplicates'.
        Last edited by Bukowski; 02-16-2013, 02:58 PM.

        Comment

        • fongchun
          Member
          • May 2011
          • 55

          #5
          I probably wasn't clear on what I want to actually do. I agree with you that we expect a lot of PCR duplicates and yes it is counter-intuitive to remove them. I am not suggesting that we remove them. I am just saying that as part of an already established pipeline we use, any libraries we align will automatically marks duplicates all sequencing libraries. I was just wondering if there was a parameter in the CollectTargetedPcrMetrics to calculate statistics on a library and ignore the fact there are marked duplicates. This would serve as an alternative solution to developing a branch in the pipeline that won't run mark duplicates. Either solution is fine. We can easily develop a branch. I would just like to know whether there was other options available.

          I intend to use all the reads whether they are duplicates or not in our future analyses.

          Hope that clarifies the confusion.


          Originally posted by Bukowski View Post
          Marking duplicates when there is PCR involved seems counter-intuitive. Wont that lead to lots of things with the same start and stop position which will appear to be duplicated? If that's part of the design, why remove them?

          Edit:

          I should clarify this. I do a lot of in-solution capture analysis, and I de-duplicate the data if I'm using (for instance) SureSelect. But if the experiment is HaloPlex I don't - because de-duplicating the data removes data that is there because of the design - it's unavoidable to have data that matches the characteristics of 'duplicates'.

          Comment

          • Bukowski
            Senior Member
            • Jan 2010
            • 388

            #6
            I guess it's no surprise that in my pipelines I have a switch that says 'don't de-dup the data' for when I need it. Pipelines are not immobile, immovable things, and they're never suitable for every situation.

            It seems to me that the problem isn't with Picard, it's behaving exactly as it should, the fact is the pipeline shouldn't be marking duplicates. I guess that answers your question though!

            Comment

            Latest Articles

            Collapse

            • SEQadmin2
              Nine Things a Sample Prep Scientist Thinks About Before Sequencing
              by SEQadmin2


              I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

              Here are nine questions we think about, in roughly the order they matter, before...
              06-18-2026, 07:11 AM
            • SEQadmin2
              From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
              by SEQadmin2


              Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


              The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
              ...
              06-02-2026, 10:05 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by SEQadmin2, 06-26-2026, 11:10 AM
            0 responses
            15 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-17-2026, 06:09 AM
            0 responses
            49 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-09-2026, 11:58 AM
            0 responses
            107 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-05-2026, 10:09 AM
            0 responses
            125 views
            0 reactions
            Last Post SEQadmin2  
            Working...