SEQanswers

Go Back   SEQanswers > Applications Forums > Genomic Resequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
example for using Picard removing duplicate reads? fabrice Bioinformatics 9 10-18-2013 02:32 AM
Removing Duplicate Reads from Torrent data vpp605 Ion Torrent 3 09-13-2013 06:26 AM
removing duplicate PE reads from unmappable data greigite Bioinformatics 0 07-31-2012 10:31 AM
Removing duplicate reads for tophat? hong_sunwoo RNA Sequencing 2 10-09-2010 12:46 AM
Removing duplicate reads from multigig .csfasta Bueller_007 Bioinformatics 7 06-26-2010 03:07 PM

Reply
 
Thread Tools
Old 02-13-2017, 03:55 AM   #1
AgatheJ
Junior Member
 
Location: England

Join Date: Mar 2012
Posts: 5
Default Removing duplicate PacBio reads

Hi all,

I think the thread title says it all but below are a bit more details about my problem:

I have used PacBio sequence capture on plant samples and, when looking at my bam files (with PacBio reads onto my reference genome), I can see that there are quite a lot of duplicate reads (starting and ending at the same positions).

I know how to remove duplicate reads when they are Illumina (using samtools rmdup) but not when they are PacBio. This is because samtools rmdup considers reads as duplicates when they have the same start and end positions and when their sequences are identical.
However, because the frequency of sequencing errors is higher in PacBio compared to Illumina, the second criterion (identical sequences) is usually false even for real duplicates.

What I would like to have is a program or script that would look for reads that have the same mapping position and probably also less than X% mismatches to then keep one of them (maybe keep the one that is most similar to the reference?). However, I was unlucky in my research.

Do you know of anything like this?

Many thanks!

Agathe

P.S. the PacBio reads are CCS reads (the consensus of at least 3 subreads)

Last edited by AgatheJ; 02-13-2017 at 04:00 AM.
AgatheJ is offline   Reply With Quote
Old 02-13-2017, 04:56 AM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,454
Default

You can identify and remove duplicate or contained reads with up to some number of edits or substitutions with BBMap's Dedupe program, though at higher error rates, the settings need to be tweaked a bit (there's a guide here).

But before you do that, why are these duplicates present, and why do you want to remove them?
Brian Bushnell is offline   Reply With Quote
Old 02-13-2017, 05:32 AM   #3
AgatheJ
Junior Member
 
Location: England

Join Date: Mar 2012
Posts: 5
Default

Hi Brian,

Thanks for your answer!
I think I get duplicates because I had to use multiple PCR steps during library preparation as part of the sequence capture protocol (one round of PCR cycles to barcode the DNA fragments and another after sequence capture).
I would like to remove these duplicates because I am afraid it will affect variant calling.
Say I have 20x at a particular position, 10 reads being identical to the reference, 10 others with an alternative base but all having the same positions (but perhaps having a low number of mismatches here and there). I would not want this position to be called as variable.
I can already filter using read depth and QUAL (maybe MQ too) but is this going to enough I am not sure. I feel that removing duplicates would remove a layer of complexity but maybe I am wrong and it is not necessary. Would like to have advice on this.

Thanks again!

Agathe
AgatheJ is offline   Reply With Quote
Old 02-13-2017, 06:00 AM   #4
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,454
Default

I would be worried about removing PCR-duplicates in this context. Doing so will enrich for lower-quality reads (since it is less obvious they are duplicates).

I'm not really sure what to do with a highly PCR-amplified PacBio library of a polyploid. Or were there only 2 rounds of amplification total, meaning at most 4 clones of each molecule?

What kind of coverage depth do you have?
What is the ploidy of the organism?
Is this whole-genome shotgun or something else?
Brian Bushnell is offline   Reply With Quote
Old 02-13-2017, 06:26 AM   #5
AgatheJ
Junior Member
 
Location: England

Join Date: Mar 2012
Posts: 5
Default

I guess it is true that I would select for lower-quality reads in theory, but since I have filtered out low quality reads in the first place (using trimmomatic), maybe that is less of a problem?

The good thing here is that I am working with a diploid. I have done two rounds of PCR steps, which is equivalent to ~30 cycles or so (15 and 15). So more than 4 copies of a molecule are definitely possible.

It seems that duplicates are not everywhere. Some regions look okay, others appear to have lots of duplicates. I am not sure why that discrepancy. Are some regions more prone to duplication?

Read depth is around 15x and the data is not whole genome but rather a specific gene family being enriched for and sequenced (using this technique: http://www.mycroarray.com/mybaits/MY...hment+kit.html)

Last edited by AgatheJ; 02-13-2017 at 06:29 AM.
AgatheJ is offline   Reply With Quote
Old 02-13-2017, 06:32 AM   #6
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,204
Default

Quote:
I have filtered out low quality reads in the first place (using trimmomatic)
What was the setting used for that? What fraction of the data/reads were removed out of the total?
GenoMax is offline   Reply With Quote
Old 02-13-2017, 07:35 AM   #7
AgatheJ
Junior Member
 
Location: England

Join Date: Mar 2012
Posts: 5
Default

I used the following parameters in Trimmomatic-0.36:

ILLUMINACLIP:TruSeq2_3-SE.fa:2:30:10 \
SLIDINGWINDOW:100:20 \
HEADCROP:100 \
MINLEN:500 \

I recovered 60,186 reads out of 65,751 and the read length distribution changed from 65-12503 to 500-6886bp.
Fastqc output looks fine.
AgatheJ is offline   Reply With Quote
Old 02-14-2017, 12:38 PM   #8
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,454
Default

Quality-trimming really makes more sense for Illumina/Ion Torrent data than PacBio CCS data. And you certainly should not be adapter-trimming using Illumina adapter sequences!

I'd recommend quality-filtering instead. And especially if you want to look for duplicates, trimming is a bad idea. You can quality-filter with BBDuk like this:

Code:
bbduk.sh in=reads.fq out=filtered.fq minlen=100 maq=10
That will filter reads with average quality below 10; the exact number you should use depends on the average number passes each read got. For 1600 bp amplicons with many passes, I was using maq=17.

Since you did amplify quite a bit, removing duplicates is probably a good idea. You can do that with Dedupe:

Code:
dedupe.sh in=filtered.fq out=deduped.fq minidentity=0.97 minlengthpercent=0.97 maxedits=200

Last edited by Brian Bushnell; 02-14-2017 at 12:40 PM.
Brian Bushnell is offline   Reply With Quote
Old 02-19-2017, 08:38 AM   #9
AgatheJ
Junior Member
 
Location: England

Join Date: Mar 2012
Posts: 5
Default

Thanks again for your answer!

I have removed Illumina adaptors because we used them for multiplexing. What we did is to prepare DNA libraries using a NebNext Library prep kit and in-house multiplexing oligos. Samples were pooled and we used sequence capture on the mix which was later sequenced with PacBio.

I guess I could remove the Illumina adaptors using trimmomatic and then do quality filtering as you suggested, and finally the duplicate removal.

Agathe
AgatheJ is offline   Reply With Quote
Reply

Tags
duplicates, pacbio, removal

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:43 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO