View Single Post
Old 01-05-2017, 11:50 AM   #29
Brian Bushnell
Super Moderator
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707

Clumpify can now do duplicate removal with the "dedupe" flag. Paired reads are only considered duplicates if both reads match. By default, all copies of a duplicate are removed except one - the highest-quality copy is retained. By default subs=2, so 2 substitutions (mismatches) are allowed between "duplicates", to compensate for sequencing error, but this can be overriden. I recommend allowing substitutions during duplicate removal; otherwise, it will enrich the dataset with reads containing errors.

Example commands:

Clumpify only; don't remove duplicates:
Code: in=reads.fq.gz out=clumped.fq.gz
Remove exact duplicates:
Code: in=reads.fq.gz out=clumped.fq.gz dedupe subs=0
Mark exact duplicates, but don't remove them (they get " duplicate" appended to the name):
Code: in=reads.fq.gz out=clumped.fq.gz markduplicates subs=0
Remove duplicates, allowing up to 5 substitutions between copies:
Code: in=reads.fq.gz out=clumped.fq.gz dedupe subs=5
Remove ALL copies of reads with duplicates rather than retaining the best copy:
Code: in=reads.fq.gz out=clumped.fq.gz dedupe allduplicates
Remove optical duplicates only (duplicates within 40 pixels of each other):
Code: in=reads.fq.gz out=clumped.fq.gz dedupe optical dist=40 spantiles=f
Note that the optimal setting for dist is platform-specific; 40 is fine for NextSeq and HiSeq2500/1T.

Remove optical duplicates and tile-edge duplicates:
Code: in=reads.fq.gz out=clumped.fq.gz dedupe optical dist=40

Clumpify only detects duplicates within the same clump. Therefore, it will always detect 100% of identical duplicates, but is not guaranteed to find all duplicates with mismatches. This is similar to deduplication by mapping - with enough mismatches, "duplicates" may map to different places or not map at all, and then they won't be detected. However, Clumpify is more sensitive to errors than mapping-based duplicate detection. To increase sensitivity, you can reduce the kmer length from the default of 31 to a smaller number like 19 with the flag "k=19", and increase the number of passes from the default of 1 to, say, 3:
Code: in=reads.fq.gz out=clumped.fq.gz dedupe k=19 passes=3 subs=5
Each pass will have a chance of identifying more duplicates, because a different kmer will be selected for seeding clumps; thus, eventually, any pair of duplicates will land in the same clump given enough passes if they share a single kmer, regardless of how many errors they have. But in practice the majority are identified in the first pass and you don't really get much more after about the 3rd pass. Decreasing K and (especially) using additional passes will take longer, and there is no point in doing them if you are running with subs=0 (identical duplicates only) because in that case all duplicates are guaranteed to be found in the first pass. If all the data fits in memory, additional passes are extremely fast; the speed penalty is only noticeable when the data does not all fit in memory. Even so, passes=3 will generally still be much faster than using mapping to remove duplicates.

I am still working on adding twin-file support to Clumpify, by the way
Brian Bushnell is offline   Reply With Quote