SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Sam file smaller than fastq scami Bioinformatics 6 10-01-2015 05:25 AM
Read in mutiple gzipped FastQ files using R Nicolas_15 Bioinformatics 4 09-04-2015 01:47 PM
fastx quality trimmer and gzipped fastq balsampoplar Bioinformatics 4 03-10-2014 06:53 AM
Script for breaking large .fa files into smaller files of [N] sequences lac302 Bioinformatics 3 02-21-2014 04:49 PM
Split fastq into smaller files lorendarith Bioinformatics 10 12-13-2012 04:28 AM

Reply
 
Thread Tools
Old 12-20-2016, 06:50 AM   #21
santiagorevale
Junior Member
 
Location: UK

Join Date: Dec 2016
Posts: 4
Default

Quote:
Originally Posted by Brian Bushnell View Post
OK, I'll make a note of that... there's nothing preventing paired file support, it's just simpler to write for interleaved files when there are stages involving splitting into lots of temp files. But I can probably add it without too much difficulty.
Hi Brian,

I was wondering if you have any plans for developing the above mentioned change in a short while (and if yes, when?), because I'm eager to implement Clumpify on our rawdata but I don't like the idea of having to go interleaved and then back. We deal daily with Tb of data and all our pipelines are set for twin paired-end files.

Also, where can I find what are the changes introduced in each implementation of BBtools?

Thank you very much for your effort!
santiagorevale is offline   Reply With Quote
Old 12-20-2016, 08:57 AM   #22
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,615
Default

Hi Santiago,

I will add support for twin files. Possibly this week, time permitting, otherwise probably next week. I find interleaved files much more convenient, but I suppose twin files are more popular overall.

BBTools changes are in /bbmap/docs/changelog.txt; just search for the current version number.

-Brian
Brian Bushnell is offline   Reply With Quote
Old 12-20-2016, 09:01 AM   #23
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,615
Default

@chiayi -

I've uploaded a new version of BBMap (36.73) that fixes the problem of incorrect memory estimation for .bz2 files. I'd still recommend setting -Xmx to slightly under half your requested memory for your cluster, though. Please let me know if this resolves the problem.
Brian Bushnell is offline   Reply With Quote
Old 12-20-2016, 09:17 AM   #24
santiagorevale
Junior Member
 
Location: UK

Join Date: Dec 2016
Posts: 4
Default

Quote:
Originally Posted by Brian Bushnell View Post
Hi Santiago,

I will add support for twin files. Possibly this week, time permitting, otherwise probably next week. I find interleaved files much more convenient, but I suppose twin files are more popular overall.

BBTools changes are in /bbmap/docs/changelog.txt; just search for the current version number.

-Brian
Hi Brian,

Thank you very much for both (quick) replies. I've another question: could you tell me the amount of cores needed to run clumpify? Because I work in a cluster environment (using SGE) and the amount of memory available for a task is related to the number of cores you assign to it.
santiagorevale is offline   Reply With Quote
Old 12-20-2016, 09:29 AM   #25
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,615
Default

Clumpify can use any number of cores. And particularly if you have pigz installed (which I highly recommend if you will be running it using multiple cores), it will use all of them. You can restrict the number of cores to 1 by telling it "threads=1" if you want. Since you get optimal compression and speed using as much memory as possible, I generally recommend running it on an exclusively-scheduled node and letting it use all memory and all cores; on a 16-core 128GB machine it will generally run at least 16 times faster if you let it use the whole machine compared to restricting it to 1 core and 8 GB RAM.

But, ultimately, it will still complete successfully with 1 core and 8 GB ram. The only difference in compression is that you get roughly 5% better compression when the whole dataset fits into memory compared to when it doesn't.
Brian Bushnell is offline   Reply With Quote
Old 12-20-2016, 09:45 AM   #26
santiagorevale
Junior Member
 
Location: UK

Join Date: Dec 2016
Posts: 4
Default

Oh, thanks! The "threads" option is missing from the command documentation. Also, how should I tell the program to run using pigz/pbzip2 (instead of gzip/bzip2)? Does it automatically detect them or do I have to specify it? I saw in a previous comment that you mentioned the option pigz=f for something, so I imagine there are both a pigz/pbzip2 option that should be set to true? I haven't found this options documented.

Thanks again!
santiagorevale is offline   Reply With Quote
Old 12-20-2016, 09:53 AM   #27
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,615
Default

By default, pigz=t and pbzip2=t. If the files are named .gz or .bz2 those will be used automatically as long as they are in the path, and will be preferred over gzip and bzip2.

As for "threads", there are some flags (like "threads", "pigz", "fastawrap", etc) that are shared by all BBTools. There are actually quite a lot of them so I don't normally mention them, to avoid bloating the usage information. But, there's a (hopefully) complete description of them in /bbmap/docs/UsageGuide.txt, in the "Standard Flags" section.
Brian Bushnell is offline   Reply With Quote
Old 12-21-2016, 04:55 AM   #28
santiagorevale
Junior Member
 
Location: UK

Join Date: Dec 2016
Posts: 4
Default

Hi Brian, Thanks for the clarification! I'll be waiting for the new clumpify update then. Happy a happy holidays!
santiagorevale is offline   Reply With Quote
Old 01-05-2017, 11:50 AM   #29
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,615
Default

Clumpify can now do duplicate removal with the "dedupe" flag. Paired reads are only considered duplicates if both reads match. By default, all copies of a duplicate are removed except one - the highest-quality copy is retained. By default subs=2, so 2 substitutions (mismatches) are allowed between "duplicates", to compensate for sequencing error, but this can be overriden. I recommend allowing substitutions during duplicate removal; otherwise, it will enrich the dataset with reads containing errors.

Example commands:

Clumpify only; don't remove duplicates:
Code:
clumpify.sh in=reads.fq.gz out=clumped.fq.gz
Remove exact duplicates:
Code:
clumpify.sh in=reads.fq.gz out=clumped.fq.gz dedupe subs=0
Mark exact duplicates, but don't remove them (they get " duplicate" appended to the name):
Code:
clumpify.sh in=reads.fq.gz out=clumped.fq.gz markduplicates subs=0
Remove duplicates, allowing up to 5 substitutions between copies:
Code:
clumpify.sh in=reads.fq.gz out=clumped.fq.gz dedupe subs=5
Remove ALL copies of reads with duplicates rather than retaining the best copy:
Code:
clumpify.sh in=reads.fq.gz out=clumped.fq.gz dedupe allduplicates
Remove optical duplicates only (duplicates within 40 pixels of each other):
Code:
clumpify.sh in=reads.fq.gz out=clumped.fq.gz dedupe optical dist=40 spantiles=f
Note that the optimal setting for dist is platform-specific; 40 is fine for NextSeq and HiSeq2500/1T.

Remove optical duplicates and tile-edge duplicates:
Code:
clumpify.sh in=reads.fq.gz out=clumped.fq.gz dedupe optical dist=40

Clumpify only detects duplicates within the same clump. Therefore, it will always detect 100% of identical duplicates, but is not guaranteed to find all duplicates with mismatches. This is similar to deduplication by mapping - with enough mismatches, "duplicates" may map to different places or not map at all, and then they won't be detected. However, Clumpify is more sensitive to errors than mapping-based duplicate detection. To increase sensitivity, you can reduce the kmer length from the default of 31 to a smaller number like 19 with the flag "k=19", and increase the number of passes from the default of 1 to, say, 3:
Code:
clumpify.sh in=reads.fq.gz out=clumped.fq.gz dedupe k=19 passes=3 subs=5
Each pass will have a chance of identifying more duplicates, because a different kmer will be selected for seeding clumps; thus, eventually, any pair of duplicates will land in the same clump given enough passes if they share a single kmer, regardless of how many errors they have. But in practice the majority are identified in the first pass and you don't really get much more after about the 3rd pass. Decreasing K and (especially) using additional passes will take longer, and there is no point in doing them if you are running with subs=0 (identical duplicates only) because in that case all duplicates are guaranteed to be found in the first pass. If all the data fits in memory, additional passes are extremely fast; the speed penalty is only noticeable when the data does not all fit in memory. Even so, passes=3 will generally still be much faster than using mapping to remove duplicates.

I am still working on adding twin-file support to Clumpify, by the way
Brian Bushnell is offline   Reply With Quote
Old 01-05-2017, 03:02 PM   #30
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,615
Default

I ran some parameter sweeps on some NextSeq E.coli 2x150bp reads to illustrate the effects of the parameters on duplicate removal.


This shows how the "dist" flag effects optical and edge duplicates. The command used was:
Code:
clumpify.sh in=first2m.fq.gz dedupe optical dist=D passes=3 subs=10 spantiles=t
...where D ran from 1 to 80, and spantiles was set to both true and false. The point at 100 is not for dist=100, but actually dist=infinity, so it represents all duplicates regardless of location. "spantiles=t", the default, includes tile-edge duplicates, while "spantiles=f" includes only optical or well duplicates (on the same tile and within D pixels of each other). This data indicates that for optical duplicates dist=30 is adequate for NextSeq, while for tile-edge duplicates, a larger value like 50 is better.


This shows the effect of increasing the number of passes for duplicate removal. The command was:
Code:
clumpify.sh in=first2m.fq.gz dedupe passes=P subs=10 k=19
...where P ran from 1 to 12. As you can see, even when allowing a large number of substitutions, the value of additional passes rapidly diminishes. If subs was set to 0 there would be no advantage to additional passes.


This shows how additional "duplicates" are detected when more mismatches are allowed. The NextSeq platform has a high error rate, and it's probably particularly bad at the tile edges (where most of these duplicates are located), which is why so many of the duplicates have a large number of mismatches. HiSeq 2500 data looks much better than this, with nearly all of the duplicates discovered at subs=1. The command used:
Code:
clumpify.sh in=first2m.fq.gz dedupe passes=3 subs=S k=19
...where S ran from 0 to 12.
Attached Images
File Type: png Dupes_vs_Distance.png (16.4 KB, 56 views)
File Type: png Dupes_vs_Passes.png (11.7 KB, 59 views)
File Type: png Dupes_vs_Subs.png (15.4 KB, 56 views)

Last edited by Brian Bushnell; 05-04-2017 at 06:42 PM.
Brian Bushnell is offline   Reply With Quote
Old 01-05-2017, 06:06 PM   #31
luc
Senior Member
 
Location: US

Join Date: Dec 2010
Posts: 281
Default

Hi Brian,

that dedupe function looks great! We have been waiting for such a tool.
luc is offline   Reply With Quote
Old 01-05-2017, 06:28 PM   #32
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,615
Default

Thanks, luc, I appreciate it.
Brian Bushnell is offline   Reply With Quote
Old 01-16-2017, 06:12 AM   #33
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,466
Default

Hi Brian, any update on allowing non-interleaved input/output? I'd love to remove the reformat.sh steps before and after clumpify.sh
dpryan is offline   Reply With Quote
Old 01-16-2017, 10:28 AM   #34
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,615
Default

Hi Devon,

Yes, this is all done, I just haven't released it yet. I'll do so tomorrow (difficult for me to do where I am now; today's a vacation day here).
Brian Bushnell is offline   Reply With Quote
Old 01-16-2017, 11:59 AM   #35
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,466
Default

Ah, right, MLK day. Enjoy the day off and stop checking SEQanswers!
dpryan is offline   Reply With Quote
Old 01-18-2017, 09:48 AM   #36
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,615
Default

It's a day late since our cluster was down yesterday, but BBTools 36.85 is released, and Clumpify now supports twin files:

Code:
clumpify.sh in1=r1.fq.gz in2=r2.fq.gz out1=c1.fq.gz out2=c2.fq.gz
Brian Bushnell is offline   Reply With Quote
Old 01-18-2017, 10:10 AM   #37
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,382
Default

Quote:
Originally Posted by Brian Bushnell View Post
It's a day late since our cluster was down yesterday, but BBTools 36.85 is released, and Clumpify now supports twin files:

Code:
clumpify.sh in1=r1.fq.gz in2=r2.fq.gz out1=c1.fq.gz out2=c2.fq.gz
Yay! Two less operations ...
GenoMax is offline   Reply With Quote
Old 01-18-2017, 10:57 AM   #38
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,466
Default

Quote:
Originally Posted by Brian Bushnell View Post
It's a day late since our cluster was down yesterday, but BBTools 36.85 is released, and Clumpify now supports twin files:

Code:
clumpify.sh in1=r1.fq.gz in2=r2.fq.gz out1=c1.fq.gz out2=c2.fq.gz
A day late is still a quick turn around

Thanks for the great update!
dpryan is offline   Reply With Quote
Old 01-23-2017, 01:18 AM   #39
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,466
Default

Feature request: It'd be quite nice to be able to write marked duplicates to a different file or files. At the moment, I have to mark duplicates and write everything to a temporary file, which is then processed. Granted, one CAN use "out=stderr.fastq" and send that to a pipe, but then one needs to deal with all of the normal stuff that's written to stderr.

The impetus behind this is removing optical duplicates before delivery to the our labs but still writing them to a separate file or files in case they need them for some reason.

BTW, do you have any recommendations for the "dist" parameter on a HiSeq 4000? I was planning to just do a parameter sweep, but if that's already been done by someone else...
dpryan is offline   Reply With Quote
Old 01-23-2017, 05:08 AM   #40
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,382
Default

Quote:
Originally Posted by dpryan View Post
Feature request: It'd be quite nice to be able to write marked duplicates to a different file or files. At the moment, I have to mark duplicates and write everything to a temporary file, which is then processed. Granted, one CAN use "out=stderr.fastq" and send that to a pipe, but then one needs to deal with all of the normal stuff that's written to stderr.
That request has been in for some time I also wanted to see counts (with associated sequence) to see how acute of a problem the duplicates may be.

For now use the following workaround provided by @Brian.

Code:
clumpify.sh in=x.fq out=y.fq markduplicates [optical allduplicates subs=0]
filterbyname.sh in=y.fq out=dupes.fq names=duplicate substring include
filterbyname.sh in=y.fq out=unique.fq names=duplicate substring include=f
Quote:
BTW, do you have any recommendations for the "dist" parameter on a HiSeq 4000? I was planning to just do a parameter sweep, but if that's already been done by someone else...
This is a bit murky. I have done the sweeps with 4000 data I have access to. If I keep the spantiles=f then I don't see any optical dups until dupedist=20. Note: The edge duplicates problem seen with NextSeq (which has @Brian setting spantiles=t by default) is not present in HiSeq 4000/MiSeq (again based on data I have seen).

I have not pulled out the reads using the method above to look at the co-ordinates/sequence as yet.

It may be good to see what you get.

Last edited by GenoMax; 01-23-2017 at 05:13 AM.
GenoMax is offline   Reply With Quote
Reply

Tags
bbduk, bbmap, bbmerge, clumpify, compression, pigz, reformat, tadpole

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:40 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO