Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • 26% duplicates

    Hi is 26% duplicates an extraordinarily high number for single end sureselect targetted SOLiD reads?
    Also I presume the duplicates are in part of the mapped reads as well?

    I used Picard's markduplicates to arrive at the rmdup bam.

    ## METRICS CLASS net.sf.picard.sam.DuplicationMetrics
    LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
    Unknown Library 61303170 0 40652492 26844757 0 0 0.437902

    101955662 in total
    0 QC failure
    26844757 duplicates
    61303170 mapped (60.13%)
    0 paired in sequencing
    0 read1
    0 read2
    0 properly paired (nan%)
    0 with itself and mate mapped
    0 singletons (nan%)
    0 with mate mapped to a different chr
    0 with mate mapped to a different chr (mapQ>=5)
    Last edited by KevinLam; 08-16-2010, 09:36 PM.
    http://kevin-gattaca.blogspot.com/

  • #2
    I wouldn't say it's extraordinary, although it is quite high. I've certainly seen higher. Depends a bit on the sample too - if you have coverage that is very high compared to the DNA represented in the sample, you will get many duplicates (you will start to sequence the same things over and over again).

    Comment


    • #3
      Hi Kopi-o,
      From my understanding, the PCR duplicates are marked by exact seq and physical proximity of the beads based on the read names pertaining to the platform.

      I can understand if it is additional coverage due to randomness. But I am concerned if perhaps I need to optimise the emulsion PCR step?
      or should I forget about removing duplicates at all? (since it is actually not marking the PCR duplicates but duplicates?)
      http://kevin-gattaca.blogspot.com/

      Comment


      • #4
        I really wouldn't dare to suggest a specific course of action ... it depends on the application you have (standard answer!). You might want to check how many of the duplicates have the exact same sequence (by using Unix sort, for example) and how many just map to the same locations (with sequence differences). That would at least tell you something.

        Comment


        • #5
          Probably not ridiculous if this is only a 50bp SE frag run, which after you remove duplicates means you can only get 50x coverage max. If you apply the birthday problem to this type of probability situation to infer what the chance is that a mapped read, which encompases a given base, is unique you will find it gets extremely discouraging after you achive 20x unique coverage. Unfortunately, this is a situation were PE runs make a huge difference to the number/percentage of duplicates.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Essential Discoveries and Tools in Epitranscriptomics
            by seqadmin


            The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
            Yesterday, 07:01 AM
          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          39 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          41 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          35 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          55 views
          0 likes
          Last Post seqadmin  
          Working...
          X