Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • duplicate reads removal

    is there any software that removes duplicate single-reads? (Casava does it for paired-end reads only)

  • #2
    According to the man page, SAMTools has a mode to do this



    mdup samtools rmdup <input.srt.bam> <out.bam>
    Remove potential PCR duplicates: if multiple read pairs have identical external coordinates, only retain the pair with highest mapping quality. This command ONLY works with FR orientation and requires ISIZE is correctly set.

    rmdupse samtools rmdupse <input.srt.bam> <out.bam>
    Remove potential duplicates for single-ended reads. This command will treat all reads as single-ended even if they are paired in fact.

    Comment


    • #3
      For duplicate removal, Picard is recommended. It does a better job than samtools-C.

      Comment


      • #4
        Duplicate removal

        Hi,

        I am educating myself on duplicate removal. Why/How is Picard better than Samtools?

        Thanks.

        Comment


        • #5
          Picard removes duplicates across chromosomes, but samtools cannot.

          Comment


          • #6
            Originally posted by lh3 View Post
            Picard removes duplicates across chromosomes, but samtools cannot.
            Is that the only notable difference?

            Comment


            • #7
              Removing duplicates before mapping.

              Hi,

              Is there any software that removes duplicate of PE or MP read
              before mapping ? I would like to remove duplicate before doing
              de novo assembly.
              Thanks.

              Corthay

              Comment


              • #8
                There is no way to determine what is a PCR duplicate at that level. That is why it has to be done at mapping level. Even then, not all of them are true PCR duplicates (read lh3's statistical calculation of the expected number of PCR dups to find in a sample).
                -drd

                Comment


                • #9
                  It is possible to dedup before mapping. You may hash the first 14bp of each end and discard a pair if the 14+14bp coincides another pair. This method is not as good as deduping after mapping, but should be good enough. On the other hand, I do not think deduping is quite necessary for assembly.

                  Comment


                  • #10
                    Originally posted by lh3 View Post
                    It is possible to dedup before mapping. You may hash the first 14bp of each end and discard a pair if the 14+14bp coincides another pair. This method is not as good as deduping after mapping, but should be good enough. On the other hand, I do not think deduping is quite necessary for assembly.
                    Thanks for the idea. I just would like to check if deduping is necessary for assembly as Panda Genome paper did it for long insert-sizes libraries.

                    Corthay

                    Comment


                    • #11
                      Originally posted by corthay View Post
                      Thanks for the idea. I just would like to check if deduping is necessary for assembly as Panda Genome paper did it for long insert-sizes libraries.

                      Corthay
                      Hi, Corthay.

                      I remove duplicates for SE and PE stuffs always. PE you should be removing between 5 and 15 percent, and for SE it'll be significantly larger and anywhere between 30 to even possibly 60 percent of your reads. It depends on the quality of the PCR step of course, which I personally know little about. Also, removing duplicates really only depends on what you're doing. If you're looking at ngs/mps/hts stuffs and you wish to accurately determine all the SNPs in your data, you probably don't have time to go through each variant that's called and so you want the most accurate call. You'd remove the duplicates. However, if you have a single gene of interest you can just as easily visually inspect whatever region or SNP, regardless of whether you removed the duplicates, and determine whether that 'call' is valid or not.

                      Comment


                      • #12
                        I have tried to use the rmdup command and have found something quite strange.

                        I have a sam file from my alignment. I view it as a bam, and then filter on quality with :
                        /data/common/programs/samtools/samtools view -h $f.srt.bam | awk '{if($5 >= 10 || $1 == "@SQ" || $1 == "@PG") print $0}' | /data/common/programs/samtools/samtools view -bS - > $f.srt.unique-qual-ge10.bam

                        this gives me the file I want to work with. I need an output for quest, with duplicates removed, so what I tried was :
                        1. First get the fields in the format needed for quest then use the UNIX sort command to get the alignments with unique chromosome, position and strand.
                        2. First use rmdup to get a new bam file then get the fields in the format needed for quest

                        And the two results are different. I would have assumed that rmdup would remove the alignments with the same chromosome, strand and position, so that if I extract sequences with sort -u for these fields I would find the same number in the end.

                        Can anyone explain this?

                        Comment


                        • #13
                          We looked into it in the end, and it simply turns out that reads with an insertion/deletion in the alignment get their start position shifted in the output, but samtools rmdup takes it into account when removing the PCR duplicates.

                          I have definitely learned something today.

                          Comment


                          • #14
                            What is acceptable PCR duplicate percentage in a ChIP-seq dataset and in a RNA-seq dataset after mapping?

                            In my ChIP-seq dataset, after mapping I found 66% duplicate by using Picard. I think this is too high so wanna know what is acceptable duplicate level?

                            Comment


                            • #15
                              Originally posted by ttnguyen View Post
                              What is acceptable PCR duplicate percentage in a ChIP-seq dataset and in a RNA-seq dataset after mapping?

                              In my ChIP-seq dataset, after mapping I found 66% duplicate by using Picard. I think this is too high so wanna know what is acceptable duplicate level?

                              It's not too high necessarily. It really depends on starting DNA quantity and how much you PCR it up. I've seen between 30% and even up to over 80% depending on the context of the protein we're after.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Recent Innovations in Spatial Biology
                                by seqadmin


                                Spatial biology is an exciting field that encompasses a wide range of techniques and technologies aimed at mapping the organization and interactions of various biomolecules in their native environments. As this area of research progresses, new tools and methodologies are being introduced, accompanied by efforts to establish benchmarking standards and drive technological innovation.

                                3D Genomics
                                While spatial biology often involves studying proteins and RNAs in their...
                                01-01-2025, 07:30 PM
                              • seqadmin
                                Advancing Precision Medicine for Rare Diseases in Children
                                by seqadmin




                                Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                                12-16-2024, 07:57 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 01-09-2025, 04:04 PM
                              0 responses
                              432 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 01-09-2025, 09:42 AM
                              0 responses
                              441 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 01-08-2025, 03:17 PM
                              0 responses
                              453 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 01-03-2025, 11:18 AM
                              1 response
                              50 views
                              1 like
                              Last Post Tonia
                              by Tonia
                               
                              Working...
                              X