Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Removing duplicate PacBio reads

    Hi all,

    I think the thread title says it all but below are a bit more details about my problem:

    I have used PacBio sequence capture on plant samples and, when looking at my bam files (with PacBio reads onto my reference genome), I can see that there are quite a lot of duplicate reads (starting and ending at the same positions).

    I know how to remove duplicate reads when they are Illumina (using samtools rmdup) but not when they are PacBio. This is because samtools rmdup considers reads as duplicates when they have the same start and end positions and when their sequences are identical.
    However, because the frequency of sequencing errors is higher in PacBio compared to Illumina, the second criterion (identical sequences) is usually false even for real duplicates.

    What I would like to have is a program or script that would look for reads that have the same mapping position and probably also less than X% mismatches to then keep one of them (maybe keep the one that is most similar to the reference?). However, I was unlucky in my research.

    Do you know of anything like this?

    Many thanks!

    Agathe

    P.S. the PacBio reads are CCS reads (the consensus of at least 3 subreads)
    Last edited by AgatheJ; 02-13-2017, 05:00 AM.

  • #2
    You can identify and remove duplicate or contained reads with up to some number of edits or substitutions with BBMap's Dedupe program, though at higher error rates, the settings need to be tweaked a bit (there's a guide here).

    But before you do that, why are these duplicates present, and why do you want to remove them?

    Comment


    • #3
      Hi Brian,

      Thanks for your answer!
      I think I get duplicates because I had to use multiple PCR steps during library preparation as part of the sequence capture protocol (one round of PCR cycles to barcode the DNA fragments and another after sequence capture).
      I would like to remove these duplicates because I am afraid it will affect variant calling.
      Say I have 20x at a particular position, 10 reads being identical to the reference, 10 others with an alternative base but all having the same positions (but perhaps having a low number of mismatches here and there). I would not want this position to be called as variable.
      I can already filter using read depth and QUAL (maybe MQ too) but is this going to enough I am not sure. I feel that removing duplicates would remove a layer of complexity but maybe I am wrong and it is not necessary. Would like to have advice on this.

      Thanks again!

      Agathe

      Comment


      • #4
        I would be worried about removing PCR-duplicates in this context. Doing so will enrich for lower-quality reads (since it is less obvious they are duplicates).

        I'm not really sure what to do with a highly PCR-amplified PacBio library of a polyploid. Or were there only 2 rounds of amplification total, meaning at most 4 clones of each molecule?

        What kind of coverage depth do you have?
        What is the ploidy of the organism?
        Is this whole-genome shotgun or something else?

        Comment


        • #5
          I guess it is true that I would select for lower-quality reads in theory, but since I have filtered out low quality reads in the first place (using trimmomatic), maybe that is less of a problem?

          The good thing here is that I am working with a diploid. I have done two rounds of PCR steps, which is equivalent to ~30 cycles or so (15 and 15). So more than 4 copies of a molecule are definitely possible.

          It seems that duplicates are not everywhere. Some regions look okay, others appear to have lots of duplicates. I am not sure why that discrepancy. Are some regions more prone to duplication?

          Read depth is around 15x and the data is not whole genome but rather a specific gene family being enriched for and sequenced (using this technique: http://www.mycroarray.com/mybaits/MY...hment+kit.html)
          Last edited by AgatheJ; 02-13-2017, 07:29 AM.

          Comment


          • #6
            I have filtered out low quality reads in the first place (using trimmomatic)
            What was the setting used for that? What fraction of the data/reads were removed out of the total?

            Comment


            • #7
              I used the following parameters in Trimmomatic-0.36:

              ILLUMINACLIP:TruSeq2_3-SE.fa:2:30:10 \
              SLIDINGWINDOW:100:20 \
              HEADCROP:100 \
              MINLEN:500 \

              I recovered 60,186 reads out of 65,751 and the read length distribution changed from 65-12503 to 500-6886bp.
              Fastqc output looks fine.

              Comment


              • #8
                Quality-trimming really makes more sense for Illumina/Ion Torrent data than PacBio CCS data. And you certainly should not be adapter-trimming using Illumina adapter sequences!

                I'd recommend quality-filtering instead. And especially if you want to look for duplicates, trimming is a bad idea. You can quality-filter with BBDuk like this:

                Code:
                bbduk.sh in=reads.fq out=filtered.fq minlen=100 maq=10
                That will filter reads with average quality below 10; the exact number you should use depends on the average number passes each read got. For 1600 bp amplicons with many passes, I was using maq=17.

                Since you did amplify quite a bit, removing duplicates is probably a good idea. You can do that with Dedupe:

                Code:
                dedupe.sh in=filtered.fq out=deduped.fq minidentity=0.97 minlengthpercent=0.97 maxedits=200
                Last edited by Brian Bushnell; 02-14-2017, 01:40 PM.

                Comment


                • #9
                  Thanks again for your answer!

                  I have removed Illumina adaptors because we used them for multiplexing. What we did is to prepare DNA libraries using a NebNext Library prep kit and in-house multiplexing oligos. Samples were pooled and we used sequence capture on the mix which was later sequenced with PacBio.

                  I guess I could remove the Illumina adaptors using trimmomatic and then do quality filtering as you suggested, and finally the duplicate removal.

                  Agathe

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM
                  • seqadmin
                    Techniques and Challenges in Conservation Genomics
                    by seqadmin



                    The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                    Avian Conservation
                    Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                    03-08-2024, 10:41 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, Yesterday, 06:37 PM
                  0 responses
                  8 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, Yesterday, 06:07 PM
                  0 responses
                  8 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-22-2024, 10:03 AM
                  0 responses
                  49 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-21-2024, 07:32 AM
                  0 responses
                  67 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X