Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • High duplicate rates in exomes

    Hello all,

    We are currently carrying out analysis of a series of exomes prepared using the Illumina TruSeq exome selection kit, which targets 62 Mb of the genome. The exomes range between 35 and 75 million 100 bp read pairs, giving us coverages in the order of 50-150x.

    Following alignment with BWA (with a mapping rate typically ~95-97%), local indel realignment with GATK and marking duplicates with picard, we are seeing extremely high rates of PCR duplicates (between 25-90%, but most commonly in the order of 40-60%). Obviously this means we are losing a huge amount of data if we persist with removing these duplicates.

    The arguments about wheter to mark duplicates or not have obviously been done to death here, but since these are the first exomes we've handled, I'm trying to understand if these results are 'normal' in exome sequencing. Clearly there is a much greater chance of identical reads occurring given the comparatively small target region, but is it normal to see such high rates of duplicate reads, or does this look more likely to to be a wet-lab issue (overamplificaion?).

    Thanks in advance for any advice anyone can give,
    James

  • #2
    While a bit high, that is not necessarily crippling, even at 60%. I would remove them as it intuitively does not make sense to keep them. That is higher than expected, and it is probably caused by some/all of the following factors (although, to be fair, I have not done TruSeq exomes so perhaps this is more the norm with them):

    1. Having a small amount of input DNA into the entire prep.
    2. Having library prep reactions not perform optimally, thus causing a smaller percentage of your fragments to be amplified during the PCR.
    3. Too many PCR cycles for one reason or another (likely due to 1 or 2).

    Comment


    • #3
      Anecdote of questionable value below...

      Just got my first data back from TruSeq Custom Enrichment using their cancer trial kit. It's only one capture containing 6 samples, but I'm seeing ~30% dups across the board for all 6. There was NO precapture PCR performed, only 10 cycles of post capture amp...this seems pretty high.

      We typically see <10% PCR dups from SureSelect panels of the same size, even though we do a small number of cycles pre-cap, and 10 after.

      I'm going to be on the hunt for more data points for performance/dups for TSCE. The fact that they perform two _identical_ sequential hybridizations (same probes, same temps, same washing conditions) says to me that they didn't develop the kit very well...so I'm skeptical of the performance of the protocol from this standpoint.

      Comment


      • #4
        James,

        What % of the target regions are covered by reads? How even is that coverage (with and without duplicate removal)?

        Comment


        • #5
          All,

          Thanks for all the comments. I've been looking further into this data, and have now obtained statistics for all our samples and can see there is a clear per-lane bias in the proportion of duplicates being found. While the per-lane average duplicate rate varies between 20% and 87%, the standard deviation on these figures is <5%, so there appears to be a distinct batch effect present. Without knowing how the samples were grouped for preparation it is hard to jump to any conclusions, but this looks extremely suspicious to me...

          Originally posted by kmcarr View Post
          What % of the target regions are covered by reads? How even is that coverage (with and without duplicate removal)?
          I'm in the process of generating those stats, and will update when I have them all available.

          Many thanks,
          James

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Essential Discoveries and Tools in Epitranscriptomics
            by seqadmin


            The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
            Yesterday, 07:01 AM
          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          42 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          43 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          38 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          55 views
          0 likes
          Last Post seqadmin  
          Working...
          X