Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Originally posted by lh3 View Post
    The problem is the enlarged variance, but perhaps the variance caused by duplicates is still small in comparison to the variance in gene expression.
    Hopefully I will be able to evaluate this (when I'll have some time)... probably there's a paper that I haven't read about this :-D
    It's time to make a call for a collaborative work on the effect of sequence dups on these kind of experiments.

    Originally posted by lh3 View Post
    I just want to know what is the true answer...
    I want that too :-)

    d

    Comment


    • #17
      Originally posted by kmcarr View Post
      Even that 8% may be an overestimate of the true PCR duplicates. I recently attended the Illumina Midwest Users Group meeting and an Illumina scientist presented some data on duplicate identification (it may have been the data you referred to since the percentages sound about the same). However they went a step further in distinguishing fragment duplicates from PCR duplicates. They prepared their paired end libraries using Illumina MID tagged style adapters, but instead of a finite set of known MID sequences, the adapters were constructed with random bases where the barcode would be. Now for each cluster they had three data points to compare, reads 1 and 2 and their respective alignment positions on the reference genome, plus the random, 6bp sequence in the MID position. A read would need to match all three of these to be called a PCR duplicate. When they added these random tags they found that the number of identified duplicates dropped from 8% to ~1%.
      Then is there still value in doing duplicate removal?
      --
      bioinfosm

      Comment


      • #18
        I don't do the bench work stuff but when I discuss the steps that are followed I came to belive that there are multiple enzymatic, sensitive steps which known to introuce bias by their nature, we are assessing duplicates after all those steps have been completed. i.e. amplification, apater ligation, re-amplification, etc etc. I belive it depends on the preperation and approach as to account for such known issues generation bias. For SNP from my experience removing duplicates is unneccesary. I suggest inspecting the alignment as a secondary step to see if there are duplicates which may prompt further investigation.

        Comment


        • #19
          Originally posted by husamia View Post
          For SNP from my experience removing duplicates is unneccesary. I suggest inspecting the alignment as a secondary step to see if there are duplicates which may prompt further investigation.
          The first time I and a few others learned about the PCR duplicates was around the middle of 2007, when Illumina (at that time Solexa) were collaborating with Sanger on SNP discovery in data from a flow-sorted X chromosome. Because Illumina/Solexa was short of DNA, they used very little remaining DNA to construct the library. When we looked at structural variation and SNP calls from that data set, we found many recurrent "sequencing" errors. Richard pointed out that this was likely to be caused by PCR. I then implemented the "rmdup" component in maq. When we applied that, we got much cleaner SNP/SV calls. In the end, the paper on this data set was published in Nature in 2008 as one the first three whole-genome resequencing papers using Illumina.

          PCR duplicates really affect SNP calling. It is true that when the duplicate rate is low, they probably have little impact to the final results, but for most resequencing project, removing duplicates does almost no harm. Why not do that?

          Comment


          • #20
            Originally posted by husamia View Post
            I don't do the bench work stuff but when I discuss the steps that are followed I came to belive that there are multiple enzymatic, sensitive steps which known to introuce bias by their nature, we are assessing duplicates after all those steps have been completed. i.e. amplification, apater ligation, re-amplification, etc etc. I belive it depends on the preperation and approach as to account for such known issues generation bias. For SNP from my experience removing duplicates is unneccesary. I suggest inspecting the alignment as a secondary step to see if there are duplicates which may prompt further investigation.
            Removing duplicates has been demonstrated a number of times to improve SNP detection, particularly on samples sequenced at low read depths and in cancer samples. If at this point you aren't seeing a benefit to de-dupping, I think the community would be very interested in your evidence of that.

            The polymerases we use for PCR are fairly biased and do not amplify linearly across all regions, so PCR duplicates likely have a dramatic effect on copy number detection, SV detection, ChIP-seq, and RNA-seq in particular. Because duplicate removal since those first papers has been such a common part of sequence analysis, I do not recall a study actually assessing how they impact CN/ChIP-seq/RNA-seq, but I do recall in our early whole genome sequencing of highly aberrant cancers at UCLA that implementing duplicate removal significantly improved SV detection. It certainly wouldn't be that difficult of an experiment to demonstrate the exact effect of de-dupping in the modern era if someone had a few spare cycles to find out (then again, it's probably been done and I just don't know the paper off the top of my head).

            That said, think about what a waste PCR duplicates are. There is little that causes as much attrition as PCR in a sequencing experiment. We're talking 5-20% of sequence data generated worldwide being thrown out. We can get it pretty low with a great deal of care calibrating our protocols (and a little bit of luck), but it still accounts for a significant percent of any dataset.

            Some potential solutions are revealing themselves: http://www.nature.com/nprot/journal/....2011.345.html
            Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
            Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
            Projects: U87MG whole genome sequence [Website] [Paper]

            Comment


            • #21
              Originally posted by foxyg View Post
              I know samtool and picard can remove duplicates. But is it really necessary? A duplicate could be PCR effect or reading same fragment twice, there is no way to tell.

              Also how do you define a duplicte? Why do both sametools and picard take in bam files as input? In theory, you can remove duplicate from raw data already. Is it because they only check the aligned location not the actual read?
              My case is quite similar to you. So finally how do you deal with your data. Is there a paper published already?

              Comment


              • #22
                Why did the "recurrent sequencing errors" is likely to be caused by PCR?

                Comment


                • #23
                  Originally posted by lh3 View Post
                  When we looked at structural variation and SNP calls from that data set, we found many recurrent "sequencing" errors. Richard pointed out that this was likely to be caused by PCR. I then implemented the "rmdup" component in maq. When we applied that, we got much cleaner SNP/SV calls.

                  Why did the "recurrent sequencing errors seems to be caused by PCR?

                  Comment


                  • #24
                    Originally posted by dongshenglulv View Post
                    Why did the "recurrent sequencing errors seems to be caused by PCR?
                    Obviously I'm not Heng, but most likely let's say you see evidence for a SNP on a few reads, but on each of those reads the snp occurs on the 37th base pair of the read. That implies a PCR duplicate. In reality, you should see a SNP occur on a bunch of different strands and at different base calling cycles within those strands.

                    Comment


                    • #25
                      There is a double quotation mark around sequencing. They are not sequencing errors. They are errors introduced by PCR and then get amplified in the following PCR cycles. When the duplicate rate is very high, you can get multiple reads containing this PCR error.

                      Comment


                      • #26
                        Hi
                        I we have done pooled sample targeted sequencing and according to rmdup and MarkDuplicates I have 90-95% of PCR duplicates. However, in my case, I think this is quite normal? We have relatively small capture region ~400Kb and we are sequencing it to very high coverage (2000x or more). Since we are trying to detect variations in the pooled samples we need high coverage. However, if I now remove potential PCR duplicates I do not have sufficient depth. Please advice!

                        lh3 could you please clearify your formula for the theoretical false dedup rate 0.28*m/s/L? I have 10-20M pairs for each pool and targeted region ~400kb.

                        Thanks!

                        Comment


                        • #27
                          Originally posted by ahven View Post
                          Hi
                          I we have done pooled sample targeted sequencing and according to rmdup and MarkDuplicates I have 90-95% of PCR duplicates. However, in my case, I think this is quite normal? We have relatively small capture region ~400Kb and we are sequencing it to very high coverage (2000x or more). Since we are trying to detect variations in the pooled samples we need high coverage. However, if I now remove potential PCR duplicates I do not have sufficient depth. Please advice!
                          This is similar to my experiment, where I did pooled sequencing of the mitochondrial genome (16.5kb). It isn't appropriate to remove PCR duplicates in this situation because you can't distinguish PCR duplicates from independent reads that map to exactly the same location.

                          Comment


                          • #28
                            Removing duplicates imposes a cap on sequencing coverage. For single end data, that cap is 2x the length of the read. (For instance with 50-mers, a base at position 100 can have at most 100 reads covering it if single end duplicates have been removed: one read going forward from bp 51-100, another forward read from 52-101, ... and then 50 more reads in the reverse direction.) For paired end, the cap is far higher, maybe several hundredx, depending on the tightness of the insert sizes. If your coverage is well below that ceiling, then any duplicates are likely PCR artifacts, and getting rid of PCR artifacts is good. If your coverage is well above that ceiling, then some of those duplicates are "real" (not from PCR, but really origainte from different pieces of DNA that sheared exactly the same way) and removing duplicates is going to get rid of some "real" data.

                            Comment


                            • #29
                              Hi,

                              As I know due to inherent mistakes in the sequencing technology, some reads will be exact copies of each other. They share the same sequence and the same alignment position and could cause trouble during SNP calling as possibly some allele is overrepresented due to amplification biases.

                              My concern is whether remove or mark duplicate necessary for Transcriptome Data before calling SNP?
                              What I was doing now for my transcriptome data set is I align, remove duplicates, realign indel, SNP calling.

                              Thanks for any advice.

                              Comment


                              • #30
                                Not usually, and never if you are using unamplified reads.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM
                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Yesterday, 06:37 PM
                                0 responses
                                8 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, Yesterday, 06:07 PM
                                0 responses
                                8 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-22-2024, 10:03 AM
                                0 responses
                                49 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-21-2024, 07:32 AM
                                0 responses
                                66 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X