SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Sam file smaller than fastq scami Bioinformatics 6 10-01-2015 06:25 AM
Read in mutiple gzipped FastQ files using R Nicolas_15 Bioinformatics 4 09-04-2015 02:47 PM
fastx quality trimmer and gzipped fastq balsampoplar Bioinformatics 4 03-10-2014 07:53 AM
Script for breaking large .fa files into smaller files of [N] sequences lac302 Bioinformatics 3 02-21-2014 05:49 PM
Split fastq into smaller files lorendarith Bioinformatics 10 12-13-2012 05:28 AM

Reply
 
Thread Tools
Old 01-23-2017, 06:11 AM   #41
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,476
Default

Additionally, is there any way to make clumpify itself respect the "threads=" setting? pigz seems to, but clumpify itself seems to use as many as it can get regardless of what I specify. This is in version 36.86.
dpryan is offline   Reply With Quote
Old 01-23-2017, 06:26 AM   #42
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,476
Default

Quote:
Originally Posted by GenoMax View Post
For now use the following workaround provided by @Brian.

Code:
clumpify.sh in=x.fq out=y.fq markduplicates [optical allduplicates subs=0]
filterbyname.sh in=y.fq out=dupes.fq names=duplicate substring include
filterbyname.sh in=y.fq out=unique.fq names=duplicate substring include=f
Thanks, in the interim I just wrote something in C that I can just call once to do this (it also strips "duplicate" from the read names).

Quote:
Originally Posted by GenoMax View Post
This is a bit murky. I have done the sweeps with 4000 data I have access to. If I keep the spantiles=f then I don't see any optical dups until dupedist=20. Note: The edge duplicates problem seen with NextSeq (which has @Brian setting spantiles=t by default) is not present in HiSeq 4000/MiSeq (again based on data I have seen).
Thanks, I'm running this now on a single sample, I'll post in image when I have a worthwhile sweep range.
dpryan is offline   Reply With Quote
Old 01-23-2017, 06:33 AM   #43
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,550
Default

Quote:
Originally Posted by dpryan View Post
Thanks, I'm running this now on a single sample, I'll post in image when I have a worthwhile sweep range.
Do the sweep with spanfiles=f and t. I was only interested in optical duplicates when I did mine.
GenoMax is offline   Reply With Quote
Old 01-23-2017, 11:33 AM   #44
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Quote:
Originally Posted by dpryan View Post
Additionally, is there any way to make clumpify itself respect the "threads=" setting? pigz seems to, but clumpify itself seems to use as many as it can get regardless of what I specify. This is in version 36.86.
Oh, hmmm... that will be very tricky. When running with one group, Clumpify should respect threads correctly. But when writing temp files (when happens whenever the reads won't all fit in memory), it uses at least one thread per temp file, and the default is a minimum of 11 temp files. Your best bet, unfortunately, would be to bind the process to a certain number of cores. You can also manually set the number of groups which indirectly affect the number of threads used.

Clumpify also uses multithreaded sorting, which uses all available cores, but normally that only happens for a small fraction of the runtime. However, I will add a flag to disable it.
Brian Bushnell is offline   Reply With Quote
Old 01-23-2017, 11:44 AM   #45
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Quote:
Originally Posted by dpryan View Post
Feature request: It'd be quite nice to be able to write marked duplicates to a different file or files. At the moment, I have to mark duplicates and write everything to a temporary file, which is then processed. Granted, one CAN use "out=stderr.fastq" and send that to a pipe, but then one needs to deal with all of the normal stuff that's written to stderr.

The impetus behind this is removing optical duplicates before delivery to the our labs but still writing them to a separate file or files in case they need them for some reason.
I will plan to add a new output stream for duplicate files as well, though I might not get to it this week.
Brian Bushnell is offline   Reply With Quote
Old 01-24-2017, 01:35 AM   #46
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,476
Default

I'm attaching two plots from a sample with ~152 million reads, one truncated at a distance of 10000 and another going all the way to 50000. For what it's worth, I noticed that the Broad is using a distance of 2500 for patterned flow cell, which seems pretty reasonable. If one enables tile spanning, then you don't see saturation until ~20000, which seems a bit over the top.

The tile spanning results seem a bit over the top, though interestingly a distance of 1 is sufficient to find stuff with that enabled. I'll post the density of duplicates according to X/Y coordinate to ensure there's no NextSeq-like tile edge effect.

Update: Yup, no tile-edge effect, so not spanning tiles makes sense.


Last edited by dpryan; 01-24-2017 at 02:40 AM.
dpryan is offline   Reply With Quote
Old 01-24-2017, 04:26 AM   #47
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,550
Default

Interesting. I was only looking at distances of up to 100 (does a plot of 10-100 show consistent increase, it should). It does look like 2500 should be a reasonable setting.
GenoMax is offline   Reply With Quote
Old 01-24-2017, 04:31 AM   #48
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,476
Default

Keep in mind that 10-100 on the HiSeq 2500 is equivalent to 100-1000 on HiSeq 3000/4000/X, apparently Illumina multiplied everything by 10 (because hey, why not?).

Anyway, yes, it's essentially a linear increase starting at 20 (or 0 if spantiles is enabled).

Edit: Originally wrote 8 instead of 20. I had something else in mind apparently...
dpryan is offline   Reply With Quote
Old 01-24-2017, 04:40 AM   #49
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,550
Default

Will try dupedist=2500. Thanks for the heads-up.

Not many had been systematically looking at duplicates (optical or otherwise) since Picard requires alignments for marking dups. We may find more duplicates than we expect/imagine once we start looking routinely.

Last edited by GenoMax; 01-26-2017 at 11:03 AM.
GenoMax is offline   Reply With Quote
Old 01-24-2017, 09:35 AM   #50
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Quote:
Originally Posted by dpryan View Post
Keep in mind that 10-100 on the HiSeq 2500 is equivalent to 100-1000 on HiSeq 3000/4000/X, apparently Illumina multiplied everything by 10 (because hey, why not?).
Oh, thanks for the heads-up! Now Genomax's earlier observations make a lot more sense; on earlier platforms you see the asymptote well before 100.

Also, it looks like maybe spantiles should default to false since it is really only applicable to the NextSeq. The reason it shows an immediate increase at 1 is because cross-tile coordinate-based duplicate detection does not use Euclidean distances like the intra-tile duplicates. Instead, since only one coordinate is expected to be comparable (either X or Y), it uses the minimum of the X difference and Y difference. So, at dist=1, you are basically comparing a cluster to every other cluster in the same row or column. This does not have much affect on HiSeq 2500 or NextSeq (other than for tile-edge duplicates) but it looks like it has a huge impact on patterned flowcells because everything is already aligned in rows or columns.
Brian Bushnell is offline   Reply With Quote
Old 01-31-2017, 12:42 PM   #51
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,550
Default

I looked at one sample from NovaSeq data that Illumina made available recently. Here is a scan for dupedist= (spantiles=f dupesubs=0) values similar to one @Devon posted above.

Temporarily removing the image until @Brian reconfirms things.

Last edited by GenoMax; 01-31-2017 at 06:35 PM.
GenoMax is offline   Reply With Quote
Old 01-31-2017, 12:52 PM   #52
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,476
Default

Interesting, I guess a good cut-off there is 20000. Gotta love needing different settings per machine.

Edit: Out of curiosity, what sort of percentage would 100 million reads represent? I also wonder if the setting will depend on which of S1/S2/S3/S4 flow cells are being used.

Last edited by dpryan; 01-31-2017 at 12:54 PM.
dpryan is offline   Reply With Quote
Old 01-31-2017, 12:59 PM   #53
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,550
Default

This sample had
Code:
Reads In:         1606392082
Clumps Formed:     356710875
Based on the name of the file it must have come from a S1 flowcell.

Last edited by GenoMax; 01-31-2017 at 01:02 PM.
GenoMax is offline   Reply With Quote
Old 01-31-2017, 06:11 PM   #54
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Quote:
Originally Posted by dpryan View Post
Interesting, I guess a good cut-off there is 20000.
I'm not really sure. The graph is strange. Genomax estimated that the tiles appear to be roughly 33k by 37k, so dist=20000 would span a third of the tile. I'm guessing that maybe this was a highly amplified library with lots of PCR duplicates, and the reason the slope decreases and eventually levels is due to a high error rate when using "subs=0". I will investigate further. But thanks @Genomax for posting it!

Quote:
Gotta love needing different settings per machine.
Yes, it makes deciding on default parameters very difficult, especially when the company never really publicizes any of this so you need to discover it empirically...
Brian Bushnell is offline   Reply With Quote
Old 02-05-2017, 07:38 AM   #55
Stewart Russell
Junior Member
 
Location: Toronto, ON, CA

Join Date: Jan 2017
Posts: 2
Default Clumpify - dedup fasta/fastq

Hey Brian, first off, thanks for designing these great tools and providing thorough explanation.

I'm trying to de-duplicate small RNA seq reads from a 2x150 bp MiSeq run. The library prep for ultra-low input has made the adapter sequences less than straight forward so I've removed those prior to attempting to merge and remove duplicates.

My question is whether clumpify respects the read names and will only remove paired duplicates, or if it can remove all duplicates in the data. My reads have 3' and 5' 4N sequences (to reduce bias in adapter ligation) that can be used to identify PCR duplicates if I can somehow collapse to unique reads before trimming them.

Thanks in advance for your suggestions,

Stewart
Stewart Russell is offline   Reply With Quote
Old 02-05-2017, 09:01 AM   #56
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,550
Default

If you have extra bases at the 3' and 5'-ends then they may complicate duplicate identification (you can add dupesubs=4) when running clumpify.

Here are combinations of flags for clumpify that may be useful.

Code:
dedupe=f optical=f (default)
Nothing happens with regards to duplicates.

dedupe=t optical=f
All duplicates are detected, whether optical or not.  All copies except one are removed for each duplicate.

dedupe=f optical=t
Nothing happens.

dedupe=t optical=t

Only optical duplicates (those with an X or Y coordinate within dist) are detected.  All copies except one are removed for each duplicate.

The allduplicates flag makes all copies of duplicates removed, rather than leaving a single copy.  But like optical, it has no effect unless dedupe=t.
GenoMax is offline   Reply With Quote
Old 02-05-2017, 09:42 AM   #57
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Quote:
Originally Posted by Stewart Russell View Post
My question is whether clumpify respects the read names and will only remove paired duplicates, or if it can remove all duplicates in the data.
Clumpify does respect pairs and by default will only consider pairs duplicates if both reads match. In this situation it makes the most sense to me to merge reads first, then remove duplicates on the merged reads:

Code:
bbmerge.sh in=r1.fq in2=r2.fq out=merged.fq outu=unmerged.fq mininsert=20
clumpify.sh in=merged.fq out=merged_clumped.fq dedupe k=20
Merging will generally remove any adapter bases on the ends because those are not part of the overlapping sequence. Also, since merging removes bases off the ends of the adapters too, there will be fewer mismatches to confuse the issue. Note in this case I set "mininsert=20" and "k=20" because I don't know what your minimum expected length is for small RNAs, but by default BBMerge won't look for insert sizes shorter than 35, and by default Clumpify will not work on reads shorter than 31bp, so set those as appropriate. If you are not expecting insert sizes shorter than 35bp then you can remove those flags.

But, if you want to deduplicate the raw reads (though I'd really recommend adapter-trimming first), and ignore pairing, you can add the "unpair" flag to Clumpify:

Code:
clumpify.sh in1=r1.fq in2=r2.fq out=clumped.fq dedupe k=20 unpair
As Genomax mentioned, you will probably need to play with the "dupesubs" flag if there are expected mismatches due to nongenomic bases. Note that after doing this pairing order will be lost so you'd need to run repair.sh to recover it.

Last edited by Brian Bushnell; 02-05-2017 at 09:45 AM.
Brian Bushnell is offline   Reply With Quote
Old 02-05-2017, 10:43 AM   #58
Stewart Russell
Junior Member
 
Location: Toronto, ON, CA

Join Date: Jan 2017
Posts: 2
Default

Quote:
Originally Posted by Brian Bushnell View Post
by default BBMerge won't look for insert sizes shorter than 35, and by default Clumpify will not work on reads shorter than 31bp, so set those as appropriate. If you are not expecting insert sizes shorter than 35bp then you can remove those flags.
This explains why I was losing so many reads from BBmerge. I should have read the doc more carefully!

Quote:
Originally Posted by Brian Bushnell View Post
But, if you want to deduplicate the raw reads (though I'd really recommend adapter-trimming first), and ignore pairing, you can add the "unpair" flag to Clumpify
Given that I'm looking to merge/remove adapters, and then collapse based on sequence + 5' and 3' 4N, allowing for 3 non-templated bases, I think I need to do

Code:
bbmerge.sh in=r1.fq in2=r2.fq out=merged.fq outu=unmerged.fq mininsert=20
clumpify.sh in=merged.fq out=merged_clumped.fq dedupe dedupesubs=3 k=20 unpair
Quote:
Originally Posted by Brian Bushnell View Post
Note that after doing this pairing order will be lost so you'd need to run repair.sh to recover it.
At this point I don't think I need to re-pair because I have my data in a clean, single file for mapping.

Thanks for the help
Stewart Russell is offline   Reply With Quote
Old 03-03-2017, 09:48 AM   #59
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,550
Default

A blog post about duplicates on HiSeq 4000 has been posted to QCFail site (with pretty graphics): https://sequencing.qcfail.com/articl...ated-sequences
GenoMax is offline   Reply With Quote
Old 03-09-2017, 12:13 AM   #60
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,476
Default

Since inevitably others will run into this and wonder...

Please note that if you run FastQC on files run through clumpify, that you will see bias in the reported duplication rates between read #1 and read #2 in a pair. In other words, the duplication rate returned by FastQC for read #2 will be much much much (often 2-3x) higher than that of read #1. Note that this is a technical artifact due to FastQC's duplication module only looking at the first 100,000 reads in each file. Since the files can be reordered for increased compressibility, the first 100,000 reads are then expected to be similar and the results skewed.

In other words, don't worry about such observations.
dpryan is offline   Reply With Quote
Reply

Tags
bbduk, bbmap, bbmerge, clumpify, compression, pigz, reformat, tadpole

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:11 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO