Seqanswers Leaderboard Ad

**Brian Bushnell** · 02-13-2017, 05:56 AM

You can identify and remove duplicate or contained reads with up to some number of edits or substitutions with BBMap's Dedupe program, though at higher error rates, the settings need to be tweaked a bit (there's a guide here).

But before you do that, why are these duplicates present, and why do you want to remove them?

**AgatheJ** · 02-13-2017, 06:32 AM

Hi Brian,

Thanks for your answer!
I think I get duplicates because I had to use multiple PCR steps during library preparation as part of the sequence capture protocol (one round of PCR cycles to barcode the DNA fragments and another after sequence capture).
I would like to remove these duplicates because I am afraid it will affect variant calling.
Say I have 20x at a particular position, 10 reads being identical to the reference, 10 others with an alternative base but all having the same positions (but perhaps having a low number of mismatches here and there). I would not want this position to be called as variable.
I can already filter using read depth and QUAL (maybe MQ too) but is this going to enough I am not sure. I feel that removing duplicates would remove a layer of complexity but maybe I am wrong and it is not necessary. Would like to have advice on this.

Thanks again!

Agathe

**Brian Bushnell** · 02-13-2017, 07:00 AM

I would be worried about removing PCR-duplicates in this context. Doing so will enrich for lower-quality reads (since it is less obvious they are duplicates).

I'm not really sure what to do with a highly PCR-amplified PacBio library of a polyploid. Or were there only 2 rounds of amplification total, meaning at most 4 clones of each molecule?

What kind of coverage depth do you have?
What is the ploidy of the organism?
Is this whole-genome shotgun or something else?

**AgatheJ** · 02-13-2017, 07:26 AM

I guess it is true that I would select for lower-quality reads in theory, but since I have filtered out low quality reads in the first place (using trimmomatic), maybe that is less of a problem?

The good thing here is that I am working with a diploid. I have done two rounds of PCR steps, which is equivalent to ~30 cycles or so (15 and 15). So more than 4 copies of a molecule are definitely possible.

It seems that duplicates are not everywhere. Some regions look okay, others appear to have lots of duplicates. I am not sure why that discrepancy. Are some regions more prone to duplication?

Read depth is around 15x and the data is not whole genome but rather a specific gene family being enriched for and sequenced (using this technique: http://www.mycroarray.com/mybaits/MY...hment+kit.html)

**GenoMax** · 02-13-2017, 07:32 AM

I have filtered out low quality reads in the first place (using trimmomatic)

What was the setting used for that? What fraction of the data/reads were removed out of the total?

**AgatheJ** · 02-13-2017, 08:35 AM

I used the following parameters in Trimmomatic-0.36:

ILLUMINACLIP:TruSeq2_3-SE.fa:2:30:10 \
SLIDINGWINDOW:100:20 \
HEADCROP:100 \
MINLEN:500 \

I recovered 60,186 reads out of 65,751 and the read length distribution changed from 65-12503 to 500-6886bp.
Fastqc output looks fine.

**Brian Bushnell** · 02-14-2017, 01:38 PM

Quality-trimming really makes more sense for Illumina/Ion Torrent data than PacBio CCS data. And you certainly should not be adapter-trimming using Illumina adapter sequences!

I'd recommend quality-filtering instead. And especially if you want to look for duplicates, trimming is a bad idea. You can quality-filter with BBDuk like this:

Code:

bbduk.sh in=reads.fq out=filtered.fq minlen=100 maq=10

That will filter reads with average quality below 10; the exact number you should use depends on the average number passes each read got. For 1600 bp amplicons with many passes, I was using maq=17.

Since you did amplify quite a bit, removing duplicates is probably a good idea. You can do that with Dedupe:

Code:

dedupe.sh in=filtered.fq out=deduped.fq minidentity=0.97 minlengthpercent=0.97 maxedits=200

**AgatheJ** · 02-19-2017, 09:38 AM

Thanks again for your answer!

I have removed Illumina adaptors because we used them for multiplexing. What we did is to prepare DNA libraries using a NebNext Library prep kit and in-house multiplexing oligos. Samples were pooled and we used sequence capture on the mix which was later sequenced with PacBio.

I guess I could remove the Illumina adaptors using trimmomatic and then do quality filtering as you suggested, and finally the duplicate removal.

Agathe

Topics	Statistics	Last Post
A Closer Look at the Enigmatic Genomes of Oikopleura dioica by seqadmin Started by seqadmin, 05-10-2024, 06:35 AM	0 responses 19 views 0 likes	Last Post by seqadmin 05-10-2024, 06:35 AM
Advanced Epigenome Editing Platform Explores Gene Regulation Mechanisms by seqadmin Started by seqadmin, 05-09-2024, 02:46 PM	0 responses 21 views 0 likes	Last Post by seqadmin 05-09-2024, 02:46 PM
Telomere Maintenance by PARP1: A New Perspective in Cancer Research by seqadmin Started by seqadmin, 05-07-2024, 06:57 AM	0 responses 20 views 0 likes	Last Post by seqadmin 05-07-2024, 06:57 AM
Enhanced Neoantigen Detection: Introducing NeoHunter by seqadmin Started by seqadmin, 05-06-2024, 07:17 AM	0 responses 21 views 0 likes	Last Post by seqadmin 05-06-2024, 07:17 AM

Seqanswers Leaderboard Ad

Announcement

Removing duplicate PacBio reads

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News