View Single Post
Old 01-23-2017, 06:08 AM   #40
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,815
Default

Quote:
Originally Posted by dpryan View Post
Feature request: It'd be quite nice to be able to write marked duplicates to a different file or files. At the moment, I have to mark duplicates and write everything to a temporary file, which is then processed. Granted, one CAN use "out=stderr.fastq" and send that to a pipe, but then one needs to deal with all of the normal stuff that's written to stderr.
That request has been in for some time I also wanted to see counts (with associated sequence) to see how acute of a problem the duplicates may be.

For now use the following workaround provided by @Brian.

Code:
clumpify.sh in=x.fq out=y.fq markduplicates [optical allduplicates subs=0]
filterbyname.sh in=y.fq out=dupes.fq names=duplicate substring include
filterbyname.sh in=y.fq out=unique.fq names=duplicate substring include=f
Quote:
BTW, do you have any recommendations for the "dist" parameter on a HiSeq 4000? I was planning to just do a parameter sweep, but if that's already been done by someone else...
This is a bit murky. I have done the sweeps with 4000 data I have access to. If I keep the spantiles=f then I don't see any optical dups until dupedist=20. Note: The edge duplicates problem seen with NextSeq (which has @Brian setting spantiles=t by default) is not present in HiSeq 4000/MiSeq (again based on data I have seen).

I have not pulled out the reads using the method above to look at the co-ordinates/sequence as yet.

It may be good to see what you get.

Last edited by GenoMax; 01-23-2017 at 06:13 AM.
GenoMax is offline   Reply With Quote