Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • fah
    Junior Member
    • Jun 2010
    • 1

    Samtools's rmdup vs. Picard's MarkDuplicates

    I am trying to remove duplicate using samtools's rmdup and rmdupse, they don't seem to just remove everything that share the same start/end sites. I couldn't seem to find a good description of what exactly rmdup/rmdupse does, nor could i understand the C source code.

    I found in another post that Picard might does a better job at this, but it seems controversial. So I am wondering what people use and what are the pros/cons of the two options.

    thanks,
  • nilshomer
    Nils Homer
    • Nov 2008
    • 1283

    #2
    Not too controversial, use Picard.

    Comment

    • xguo
      Member
      • Jul 2008
      • 48

      #3
      Originally posted by nilshomer View Post
      Not too controversial, use Picard.
      Do you know the difference between Picard and samtools in terms of duplicate removal? Thanks.

      Comment

      • nilshomer
        Nils Homer
        • Nov 2008
        • 1283

        #4
        Originally posted by xguo View Post
        Do you know the difference between Picard and samtools in terms of duplicate removal? Thanks.
        Something about not considering paired-end/mate-pair reads correctly. Did you try searching the site yet?

        Comment

        • xguo
          Member
          • Jul 2008
          • 48

          #5
          Originally posted by nilshomer View Post
          Something about not considering paired-end/mate-pair reads correctly. Did you try searching the site yet?
          I did, but haven't found any details. For single-end reads, samtools consider them to be duplicates as long as their mapping locations are the same. Even if the sequences of two reads are not the same, they are still removed as duplicates. I'm not sure if it's desired, and how picard handles it.

          Comment

          • nilshomer
            Nils Homer
            • Nov 2008
            • 1283

            #6
            Originally posted by xguo View Post
            I did, but haven't found any details. For single-end reads, samtools consider them to be duplicates as long as their mapping locations are the same. Even if the sequences of two reads are not the same, they are still removed as duplicates. I'm not sure if it's desired, and how picard handles it.
            I don't think the sequences are considered, since two observations of the same PCR fragment could be different. You could read the source code or email the samtools help list ([email protected]). Please post back when you get the answer, or maybe add it to the wiki under duplicate removal for the future benefit of others.

            Also, the first search on google using the string "site: seqanswers.com duplicate removal" was useful.
            Last edited by nilshomer; 06-13-2010, 08:22 PM. Reason: more info

            Comment

            • xguo
              Member
              • Jul 2008
              • 48

              #7
              Originally posted by nilshomer View Post
              I don't think the sequences are considered, since two observations of the same PCR fragment could be different. You could read the source code or email the samtools help list ([email protected]). Please post back when you get the answer, or maybe add it to the wiki under duplicate removal for the future benefit of others.

              Also, the first search on google using the string "site: seqanswers.com duplicate removal" was useful.
              The following is what I found from samtools-help list:

              "Essentially what Picard does (for pairs; single-end data is also handled) is to find the 5' coordinates and mapping orientations of each read pair. When doing this it takes into account all clipping that has taking place as well as any gaps or jumps in the alignment. You can thus think of it as determining "if all the bases from the read were aligned, where would the 5' most base
              have been aligned". It then matches all read pairs that have identical 5' coordinates and orientations and marks as duplicates all but the "best" pair. "Best" is defined as the read pair having the highest sum of base qualities as bases with Q >= 15."

              Comment

              • nilshomer
                Nils Homer
                • Nov 2008
                • 1283

                #8
                Originally posted by xguo View Post
                The following is what I found from samtools-help list:

                "Essentially what Picard does (for pairs; single-end data is also handled) is to find the 5' coordinates and mapping orientations of each read pair. When doing this it takes into account all clipping that has taking place as well as any gaps or jumps in the alignment. You can thus think of it as determining "if all the bases from the read were aligned, where would the 5' most base
                have been aligned". It then matches all read pairs that have identical 5' coordinates and orientations and marks as duplicates all but the "best" pair. "Best" is defined as the read pair having the highest sum of base qualities as bases with Q >= 15."
                This would be useful to add to http://wiki.seqanswers.com/ so we can build up a base of answers and not have to search the forums. Do you want to add it?

                Comment

                • mingkunli
                  Member
                  • Jan 2009
                  • 42

                  #9
                  now I come to anther question: once we have reads with various lengths, should 3' coordinates should also be considerred(or length of the aligned part) ? for reads from the reverse strand, 3' is the beginning of the fragment, am I right?



                  Originally posted by xguo View Post
                  The following is what I found from samtools-help list:

                  "Essentially what Picard does (for pairs; single-end data is also handled) is to find the 5' coordinates and mapping orientations of each read pair. When doing this it takes into account all clipping that has taking place as well as any gaps or jumps in the alignment. You can thus think of it as determining "if all the bases from the read were aligned, where would the 5' most base
                  have been aligned". It then matches all read pairs that have identical 5' coordinates and orientations and marks as duplicates all but the "best" pair. "Best" is defined as the read pair having the highest sum of base qualities as bases with Q >= 15."

                  Comment

                  • JohnK
                    Senior Member
                    • Feb 2010
                    • 106

                    #10
                    Originally posted by xguo View Post
                    I did, but haven't found any details. For single-end reads, samtools consider them to be duplicates as long as their mapping locations are the same. Even if the sequences of two reads are not the same, they are still removed as duplicates. I'm not sure if it's desired, and how picard handles it.
                    Are you sure of this?

                    Comment

                    • lh3
                      Senior Member
                      • Feb 2008
                      • 686

                      #11
                      I believe Picard is doing the same for SE and this is the right thing to do. PCR duplicates can have different sequences. Optical duplicates have identical sequences.

                      Comment

                      • JohnK
                        Senior Member
                        • Feb 2010
                        • 106

                        #12
                        Hi lh3, thanks for the response! I understand PCR duplicates, but can you clarify what an optical duplicate is? I like what you are saying. Thanks again!

                        Comment

                        • Michael.James.Clark
                          Senior Member
                          • Apr 2009
                          • 207

                          #13
                          Originally posted by JohnK View Post
                          Hi lh3, thanks for the response! I understand PCR duplicates, but can you clarify what an optical duplicate is? I like what you are saying. Thanks again!
                          There is a nice discussion of such things (including a post where Heng explains the answer to your question) as well as links to numerous other threads on the topic here: http://seqanswers.com/forums/showthread.php?t=6854

                          Hope that helps.

                          Originally posted by mingkunli
                          now I come to anther question: once we have reads with various lengths, should 3' coordinates should also be considerred(or length of the aligned part) ? for reads from the reverse strand, 3' is the beginning of the fragment, am I right?
                          PCR duplicates do not necessarily have the same sequence OR the same length. The 5' end of the read is the most reliable position because if it's a PCR duplicate, it will be guaranteed to be the same at that end but not at the 3' end. That is, I think, the reasoning behind not considering the 3' end.

                          Also, reads from the reverse strand are still in 5'->3' context. The enzymes always go in the same direction, right?
                          Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
                          Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
                          Projects: U87MG whole genome sequence [Website] [Paper]

                          Comment

                          • maricu
                            Junior Member
                            • Apr 2010
                            • 8

                            #14
                            Originally posted by lh3 View Post
                            I believe Picard is doing the same for SE and this is the right thing to do. PCR duplicates can have different sequences. Optical duplicates have identical sequences.
                            I might arrive quite late to this thread, but I am going through the same dilemma... In one section of the 1000 genomes paper (suppl) they mention that they use rmdup and then to make sure the run MarkDuplicates on top of that... Would this be the ultimate best approach??

                            In my case I am using SE data, and if I'm guessing right for SE picard and samtools work the same for dupl removal??

                            Thanks!

                            Comment

                            • swbarnes2
                              Senior Member
                              • May 2008
                              • 910

                              #15
                              I tried to use picard's mark duplicates, but it doesn't seem to be doing much of a job (The sam file is from a bwa paired end alignment)

                              java -jar SortSam.jar I=full_genome_grepped.sam O=full_genome_grepped_sorted.sam SO=coordinate VALIDATION_STRINGENCY=LENIENT

                              java -jar MarkDuplicates.jar I=full_genome_grepped_sorted.sam O=full_genome_grepped_sorted_deduped.sam M=Dedup_metrics.txt VALIDATION_STRINGENCY=LENIENT QUIET=true REMOVE_DUPLICATES=true

                              The resultant file is only a little smaller than the original. So I checked some known duplicates;

                              grep 1:78:10366:8808 full_genome_grepped_sorted_deduped.sam
                              100729:1:78:10366:8808:a: 99 chr9 3000477 0 50M = 3000810 383 AGTGGATGTTTCTCATTTTCCATGATTTTCAGTTTTCTTGCCATATTCCA gggggggggggggggggggggdgfcgggggdggggggggegggggggggg X0:i:23 X1:i:27 MD:Z:50 XG:i:0 AM:i:0 NM:i:0 SM:i:0 XM:i:0 XO:i:0 XT:A:R
                              100729:1:78:10366:8808:b: 147 chr9 3000810 0 50M = 3000477 -383 GCCATATTTTACGTCCTAAAGTGTGCATTTCTCATTTTTCACGTTTTTCA hhhhhhhghhhhhhhhhhhhhhhhhghhhhhhhghhhhhhhhhhhhhhhh X0:i:4 X1:i:0 XA:Z:chr9,-3022120,50M,3;chr9,-3011768,50M,3;chr9,-3019372,50M,3; MD:Z:9C1G6C31 XG:i:0 AM:i:0 NM:i:3 SM:i:0 XM:i:3 XO:i:0 XT:A:R

                              grep 1:80:15287:8131 full_genome_grepped_sorted_deduped.sam
                              100729:1:80:15287:8131:a: 99 chr9 3000477 3 50M = 3000810 383 AGTGGATGTTTCTCATTTTCCATGATTTTCAGTTTTCTTGCCATATTCCA ggggggggggggggfggggggcggggggggggggggggggggggggggga X0:i:23 X1:i:27 MD:Z:50 XG:i:0 AM:i:0 NM:i:0 SM:i:0 XM:i:0 XO:i:0 XT:A:R
                              100729:1:80:15287:8131:b: 147 chr9 3000810 3 50M = 3000477 -383 GCCATATTTTACGTCCTAAAGTGTGCATTTCTCATTTTTCACGTTTTTCA hhhhhhhhhhhghhhhhhhhhfhhhhhhhhhhghhhhhhhhhhhhhhhhh X0:i:4 X1:i:0 XA:Z:chr9,-3000810,50M,3;chr9,-3011768,50M,3;chr9,-3019372,50M,3; MD:Z:9C1G6C31 XG:i:0 AM:i:0 NM:i:3 SM:i:0 XM:i:3 XO:i:0 XT:A:R

                              I was hoping that one of the two would be deleted from the deduped file, since they look like perfect duplicates to me.
                              Last edited by swbarnes2; 11-19-2010, 03:18 PM. Reason: typos

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Pathogen Surveillance with Advanced Genomic Tools
                                by seqadmin




                                The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
                                03-24-2025, 11:48 AM
                              • seqadmin
                                New Genomics Tools and Methods Shared at AGBT 2025
                                by seqadmin


                                This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                                The Headliner
                                The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                                03-03-2025, 01:39 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Today, 10:17 AM
                              0 responses
                              7 views
                              0 reactions
                              Last Post seqadmin  
                              Started by seqadmin, 03-20-2025, 05:03 AM
                              0 responses
                              49 views
                              0 reactions
                              Last Post seqadmin  
                              Started by seqadmin, 03-19-2025, 07:27 AM
                              0 responses
                              59 views
                              0 reactions
                              Last Post seqadmin  
                              Started by seqadmin, 03-18-2025, 12:50 PM
                              0 responses
                              50 views
                              0 reactions
                              Last Post seqadmin  
                              Working...