![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Sam file smaller than fastq | scami | Bioinformatics | 6 | 10-01-2015 06:25 AM |
Read in mutiple gzipped FastQ files using R | Nicolas_15 | Bioinformatics | 4 | 09-04-2015 02:47 PM |
fastx quality trimmer and gzipped fastq | balsampoplar | Bioinformatics | 4 | 03-10-2014 07:53 AM |
Script for breaking large .fa files into smaller files of [N] sequences | lac302 | Bioinformatics | 3 | 02-21-2014 05:49 PM |
Split fastq into smaller files | lorendarith | Bioinformatics | 10 | 12-13-2012 05:28 AM |
![]() |
|
Thread Tools |
![]() |
#21 | |
Member
Location: UK Join Date: Dec 2016
Posts: 17
|
![]() Quote:
I was wondering if you have any plans for developing the above mentioned change in a short while (and if yes, when?), because I'm eager to implement Clumpify on our rawdata but I don't like the idea of having to go interleaved and then back. We deal daily with Tb of data and all our pipelines are set for twin paired-end files. Also, where can I find what are the changes introduced in each implementation of BBtools? Thank you very much for your effort! |
|
![]() |
![]() |
![]() |
#22 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
Hi Santiago,
I will add support for twin files. Possibly this week, time permitting, otherwise probably next week. I find interleaved files much more convenient, but I suppose twin files are more popular overall. BBTools changes are in /bbmap/docs/changelog.txt; just search for the current version number. -Brian |
![]() |
![]() |
![]() |
#23 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
@chiayi -
I've uploaded a new version of BBMap (36.73) that fixes the problem of incorrect memory estimation for .bz2 files. I'd still recommend setting -Xmx to slightly under half your requested memory for your cluster, though. Please let me know if this resolves the problem. |
![]() |
![]() |
![]() |
#24 | |
Member
Location: UK Join Date: Dec 2016
Posts: 17
|
![]() Quote:
Thank you very much for both (quick) replies. I've another question: could you tell me the amount of cores needed to run clumpify? Because I work in a cluster environment (using SGE) and the amount of memory available for a task is related to the number of cores you assign to it. |
|
![]() |
![]() |
![]() |
#25 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
Clumpify can use any number of cores. And particularly if you have pigz installed (which I highly recommend if you will be running it using multiple cores), it will use all of them. You can restrict the number of cores to 1 by telling it "threads=1" if you want. Since you get optimal compression and speed using as much memory as possible, I generally recommend running it on an exclusively-scheduled node and letting it use all memory and all cores; on a 16-core 128GB machine it will generally run at least 16 times faster if you let it use the whole machine compared to restricting it to 1 core and 8 GB RAM.
But, ultimately, it will still complete successfully with 1 core and 8 GB ram. The only difference in compression is that you get roughly 5% better compression when the whole dataset fits into memory compared to when it doesn't. |
![]() |
![]() |
![]() |
#26 |
Member
Location: UK Join Date: Dec 2016
Posts: 17
|
![]()
Oh, thanks! The "threads" option is missing from the command documentation. Also, how should I tell the program to run using pigz/pbzip2 (instead of gzip/bzip2)? Does it automatically detect them or do I have to specify it? I saw in a previous comment that you mentioned the option pigz=f for something, so I imagine there are both a pigz/pbzip2 option that should be set to true? I haven't found this options documented.
Thanks again! |
![]() |
![]() |
![]() |
#27 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
By default, pigz=t and pbzip2=t. If the files are named .gz or .bz2 those will be used automatically as long as they are in the path, and will be preferred over gzip and bzip2.
As for "threads", there are some flags (like "threads", "pigz", "fastawrap", etc) that are shared by all BBTools. There are actually quite a lot of them so I don't normally mention them, to avoid bloating the usage information. But, there's a (hopefully) complete description of them in /bbmap/docs/UsageGuide.txt, in the "Standard Flags" section. |
![]() |
![]() |
![]() |
#28 |
Member
Location: UK Join Date: Dec 2016
Posts: 17
|
![]()
Hi Brian, Thanks for the clarification! I'll be waiting for the new clumpify update then. Happy a happy holidays!
|
![]() |
![]() |
![]() |
#29 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
Clumpify can now do duplicate removal with the "dedupe" flag. Paired reads are only considered duplicates if both reads match. By default, all copies of a duplicate are removed except one - the highest-quality copy is retained. By default subs=2, so 2 substitutions (mismatches) are allowed between "duplicates", to compensate for sequencing error, but this can be overriden. I recommend allowing substitutions during duplicate removal; otherwise, it will enrich the dataset with reads containing errors.
Example commands: Clumpify only; don't remove duplicates: Code:
clumpify.sh in=reads.fq.gz out=clumped.fq.gz Code:
clumpify.sh in=reads.fq.gz out=clumped.fq.gz dedupe subs=0 Code:
clumpify.sh in=reads.fq.gz out=clumped.fq.gz markduplicates subs=0 Code:
clumpify.sh in=reads.fq.gz out=clumped.fq.gz dedupe subs=5 Code:
clumpify.sh in=reads.fq.gz out=clumped.fq.gz dedupe allduplicates Code:
clumpify.sh in=reads.fq.gz out=clumped.fq.gz dedupe optical dist=40 spantiles=f Remove optical duplicates and tile-edge duplicates: Code:
clumpify.sh in=reads.fq.gz out=clumped.fq.gz dedupe optical dist=40 Clumpify only detects duplicates within the same clump. Therefore, it will always detect 100% of identical duplicates, but is not guaranteed to find all duplicates with mismatches. This is similar to deduplication by mapping - with enough mismatches, "duplicates" may map to different places or not map at all, and then they won't be detected. However, Clumpify is more sensitive to errors than mapping-based duplicate detection. To increase sensitivity, you can reduce the kmer length from the default of 31 to a smaller number like 19 with the flag "k=19", and increase the number of passes from the default of 1 to, say, 3: Code:
clumpify.sh in=reads.fq.gz out=clumped.fq.gz dedupe k=19 passes=3 subs=5 I am still working on adding twin-file support to Clumpify, by the way ![]() |
![]() |
![]() |
![]() |
#30 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
I ran some parameter sweeps on some NextSeq E.coli 2x150bp reads to illustrate the effects of the parameters on duplicate removal.
This shows how the "dist" flag effects optical and edge duplicates. The command used was: Code:
clumpify.sh in=first2m.fq.gz dedupe optical dist=D passes=3 subs=10 spantiles=t This shows the effect of increasing the number of passes for duplicate removal. The command was: Code:
clumpify.sh in=first2m.fq.gz dedupe passes=P subs=10 k=19 This shows how additional "duplicates" are detected when more mismatches are allowed. The NextSeq platform has a high error rate, and it's probably particularly bad at the tile edges (where most of these duplicates are located), which is why so many of the duplicates have a large number of mismatches. HiSeq 2500 data looks much better than this, with nearly all of the duplicates discovered at subs=1. The command used: Code:
clumpify.sh in=first2m.fq.gz dedupe passes=3 subs=S k=19 Last edited by Brian Bushnell; 05-04-2017 at 07:42 PM. |
![]() |
![]() |
![]() |
#31 |
Senior Member
Location: US Join Date: Dec 2010
Posts: 452
|
![]()
Hi Brian,
that dedupe function looks great! We have been waiting for such a tool. |
![]() |
![]() |
![]() |
#32 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
Thanks, luc, I appreciate it.
|
![]() |
![]() |
![]() |
#33 |
Devon Ryan
Location: Freiburg, Germany Join Date: Jul 2011
Posts: 3,480
|
![]()
Hi Brian, any update on allowing non-interleaved input/output? I'd love to remove the reformat.sh steps before and after clumpify.sh
![]() |
![]() |
![]() |
![]() |
#34 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
Hi Devon,
Yes, this is all done, I just haven't released it yet. I'll do so tomorrow (difficult for me to do where I am now; today's a vacation day here). |
![]() |
![]() |
![]() |
#35 |
Devon Ryan
Location: Freiburg, Germany Join Date: Jul 2011
Posts: 3,480
|
![]()
Ah, right, MLK day. Enjoy the day off and stop checking SEQanswers!
|
![]() |
![]() |
![]() |
#36 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
It's a day late since our cluster was down yesterday, but BBTools 36.85 is released, and Clumpify now supports twin files:
Code:
clumpify.sh in1=r1.fq.gz in2=r2.fq.gz out1=c1.fq.gz out2=c2.fq.gz |
![]() |
![]() |
![]() |
#37 |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,081
|
![]() |
![]() |
![]() |
![]() |
#38 | |
Devon Ryan
Location: Freiburg, Germany Join Date: Jul 2011
Posts: 3,480
|
![]() Quote:
![]() Thanks for the great update! |
|
![]() |
![]() |
![]() |
#39 |
Devon Ryan
Location: Freiburg, Germany Join Date: Jul 2011
Posts: 3,480
|
![]()
Feature request: It'd be quite nice to be able to write marked duplicates to a different file or files. At the moment, I have to mark duplicates and write everything to a temporary file, which is then processed. Granted, one CAN use "out=stderr.fastq" and send that to a pipe, but then one needs to deal with all of the normal stuff that's written to stderr.
The impetus behind this is removing optical duplicates before delivery to the our labs but still writing them to a separate file or files in case they need them for some reason. BTW, do you have any recommendations for the "dist" parameter on a HiSeq 4000? I was planning to just do a parameter sweep, but if that's already been done by someone else... ![]() |
![]() |
![]() |
![]() |
#40 | ||
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,081
|
![]() Quote:
![]() For now use the following workaround provided by @Brian. Code:
clumpify.sh in=x.fq out=y.fq markduplicates [optical allduplicates subs=0] filterbyname.sh in=y.fq out=dupes.fq names=duplicate substring include filterbyname.sh in=y.fq out=unique.fq names=duplicate substring include=f Quote:
I have not pulled out the reads using the method above to look at the co-ordinates/sequence as yet. It may be good to see what you get. Last edited by GenoMax; 01-23-2017 at 06:13 AM. |
||
![]() |
![]() |
![]() |
Tags |
bbduk, bbmap, bbmerge, clumpify, compression, pigz, reformat, tadpole |
Thread Tools | |
|
|