Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • duplicate PCR Single and Paired End

    Hi NGS user,
    I'm developing a script to remove PCR duplicates. I observed that samtools rmdup function doesn't always work well.
    I'm using bioperl object and Bio:B::Sam utilities.
    I have to distinguish if a bam is paired-end or single-read. Is there anyway to do this by Bio:B::Sam?

    To remove duplicate,
    if the NGS experiment is a single read, I considered duplicates read with the same start/end and with the same sequence.
    If the NGS experiment is a paired end, a read is a PCR duplicates if has the same start of the first mate and the end of the second mate. Is it right?

    Thanx a lot!
    ME

  • #2
    I would say that PCR duplicates in SE sequencing can appear as having different end coordinates and different sequences because the PCR process itself can induce errors and the sequencing can introduce errors as well. That is why it's really difficult to remove actual PCR duplicates with single end reads because you really don't know if it's caused by PCR duplication, real enrichment (e.g. in an RNA-seq or CHIP-seq) or simply saturation of the sequencing depth. Optical duplicates in an illumina run can be identified by the actual cluster proximity and they will have identical sequence.

    If you want to be really thorough and remove all duplicates (PCR or not), and you trust the quality of the alignment, you could define PCR duplicates in a single end experiment as reads with identical start point (5' of the read, so in the SAM/BAM format you would have to use the CIGAR string to determine the start position of the reverse strand reads) and then ignore any sequence information or indels in the read. That would remove all PCR duplicates, but also duplicate reads due to enrichment etc. It really depends on what type of analysis you want to make and the quality of the data.

    PCR duplicates in PE experiments are much easier to identify since they would have equal total fragment length and fragment location, like you mention.
    Last edited by Thomas Doktor; 02-21-2011, 11:11 AM.

    Comment


    • #3
      Thanx a lot Thomas for this clear answer!

      Comment


      • #4
        you can use picard tools to remove PCR duplicates...

        Comment


        • #5
          What if you wanted to collapse all PCR duplicates into a single read? Doesn't picard tools simply flag them as PCR duplicates and if you remove them you will lose that read position completely?
          Or does picard tools only flag subsequent duplicates and not the first one it encountered?

          Comment


          • #6
            Originally posted by Thomas Doktor View Post
            What if you wanted to collapse all PCR duplicates into a single read? Doesn't picard tools simply flag them as PCR duplicates and if you remove them you will lose that read position completely?
            Or does picard tools only flag subsequent duplicates and not the first one it encountered?
            I am not sure what "collapse all PCR reads mean"...Can you explain in more detail...

            Picard mark reads as duplicates and remove it only when REMOVE_DUPLICATES is set to 'YES'...If you keep this option 'NO' it will just make duplicate and keep reads in file...Most of the downstream analysis tools considers uniquely mapped reads so one read per start position still gives all information....

            Comment


            • #7
              By collapsing I mean removing all but a single of the duplicate reads. My question is if picard tools marks all the duplicate reads as duplicates or keeps one read as the "original"?

              Comment


              • #8
                I believe it keeps one read per start position and discards other reads. I have seen these kind of alignments IGV...

                Comment


                • #9
                  Thanks for clearing that up.

                  Comment


                  • #10
                    picard keeps the duplicate with best quality values and removes/marks the rest

                    Comment


                    • #11
                      I used the MarkDuplicate of Picard tools, and I got the very strange output. There are 0 count for UNPAIRED_READS_EXAMINED and UNPAIRED_READ_DUPLICATES. I found it is very strange.
                      Here is the detail output:
                      LIBRARY Unknown Library
                      UNPAIRED_READS_EXAMINED 0
                      READ_PAIRS_EXAMINED 25574519
                      UNMAPPED_READS 35323240
                      UNPAIRED_READ_DUPLICATES 0
                      READ_PAIR_DUPLICATES 5614734
                      READ_PAIR_OPTICAL_DUPLICATES 0
                      PERCENT_DUPLICATION 0.219544
                      ESTIMATED_LIBRARY_SIZE 49364927

                      Can someone explained it to me? Why there is no unpaired mapped reads found in the SAM file?
                      Thanks
                      R

                      Comment


                      • #12
                        My guess is that it's not unpaired mapped reads, but simply unpaired reads. Both reads are present in the SAM file, even if only one of them is actually mapped. Picard is just telling you that it found all read pairs in the SAM file, not that all read pairs were mapped (since you have a lot of unmapped reads). If you remove all the unmapped reads, my guess is that Picard would report finding some unpaired reads.

                        Comment


                        • #13
                          Please check the MarkDuplidate's column definition.

                          LIBRARY: The library on which the duplicate marking was performed.
                          UNPAIRED_READS_EXAMINED: The number of mapped reads examined which did not have a mapped mate pair.
                          READ_PAIRS_EXAMINED: The number of mapped read pairs examined.
                          UNMAPPED_READS: The total number of unmapped reads examined.
                          UNPAIRED_READ_DUPLICATES: The number of fragments that were marked as duplicates.
                          READ_PAIR_DUPLICATES: The number of read pairs that were marked as duplicates.
                          READ_PAIR_OPTICAL_DUPLICATES: The number of read pairs duplicates that were caused by optical duplication. Value is always < READ_PAIR_DUPLICATES, which counts all duplicates regardless of source.
                          PERCENT_DUPLICATION: The percentage of mapped sequence that is marked as duplicate.
                          ESTIMATED_LIBRARY_SIZE: The estimated number of unique molecules in the library based on PE duplication.


                          I don't believed that UNPAIRED_READS_EXAMINED = 0 count in my SAM file. From the Picard reports that I only have the unmapped reads, the mapped read pairs, and the duplicated mapped read pairs.

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Strategies for Sequencing Challenging Samples
                            by seqadmin


                            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                            03-22-2024, 06:39 AM
                          • seqadmin
                            Techniques and Challenges in Conservation Genomics
                            by seqadmin



                            The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                            Avian Conservation
                            Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                            03-08-2024, 10:41 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, Yesterday, 06:37 PM
                          0 responses
                          11 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, Yesterday, 06:07 PM
                          0 responses
                          10 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 03-22-2024, 10:03 AM
                          0 responses
                          51 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 03-21-2024, 07:32 AM
                          0 responses
                          67 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X