![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Sam file smaller than fastq | scami | Bioinformatics | 6 | 10-01-2015 06:25 AM |
Read in mutiple gzipped FastQ files using R | Nicolas_15 | Bioinformatics | 4 | 09-04-2015 02:47 PM |
fastx quality trimmer and gzipped fastq | balsampoplar | Bioinformatics | 4 | 03-10-2014 07:53 AM |
Script for breaking large .fa files into smaller files of [N] sequences | lac302 | Bioinformatics | 3 | 02-21-2014 05:49 PM |
Split fastq into smaller files | lorendarith | Bioinformatics | 10 | 12-13-2012 05:28 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
I'd like to introduce a new member of the BBMap package, Clumpify. This is a bit different from other tools in that it does not actually change your data at all, simply reorders it to maximize gzip compression. Therefore, the output files are still fully-compatible gzipped fastq files, and Clumpify has no effect on downstream analysis aside from making it faster. It’s quite simple to use:
Code:
clumpify.sh in=reads.fq.gz out=clumped.fq.gz reorder How does this work? Clumpify operates on a similar principle to that which makes sorted bam files smaller than unsorted bam files – the reads are reordered so that reads with similar sequence are nearby, which makes gzip compression more efficient. But unlike sorted bam, during this process, pairs are kept together so that an interleaved file will remain interleaved with pairing intact. Also unlike a sorted bam, it does not require mapping or a reference, and except in very unusual cases, can be done with an arbitrarily small amount of memory. So, it’s very fast and memory-efficient compared to mapping, and can be done with no knowledge of what organism(s) the reads came from. Internally, Clumpify forms clumps of reads sharing special ‘pivot’ kmers, implying that those reads overlap. These clumps are then further sorted by position of the kmer in the read so that within a clump the reads are position-sorted. The net result is a list of sorted clumps of reads, yielding compression within a percent or so of sorted bam. How long does Clumpify take? It's very fast. If all data can fit in memory, Clumpify needs the amount of time it takes to read and write the file once. If the data cannot fit in memory, it takes around twice that long. Why does this increase speed? There are a lot of processes that are I/O limited. For example, on a multicore processor, using BBDuk, BBMerge, Reformat, etc. on a gzipped fastq will generally be rate-limited by gzip decompression (even if you use pigz, which is much faster at decompression than gzip). Gzip decompression seems to be rate-limited by the number of input bytes per second rather than output, meaning that a raw file of a given size will decompress X% faster if it is compressed Y% smaller; here X and Y are proportional, though not quite 1-to-1. In my tests, assembly with Spades and Megahit have time reductions from using Clumpified input that more than pays for the time needed to run Clumpify, largely because both are multi-kmer assemblers which read the input file multiple times. Something purely CPU-limited like mapping would normally not benefit much in terms of speed (though still a bit due to improved cache locality). When and how should Clumpify be used? If you want to clumpify data for compression, do it as early as possible (e.g. on the raw reads). Then run all downstream processing steps ensuring that read order is maintained (e.g. use the “ordered” flag if you use BBDuk for adapter-trimming) so that the clump order is maintained; thus, all intermediate files will benefit from the increased compression and increased speed. I recommend running Clumpify on ALL data that will ever go into long-term storage, or whenever there is a long pipeline with multiple steps and intermediate gzipped files. Also, even when data will not go into long-term storage, if a shared filesystem is being used or files need to be sent over the internet, running Clumpify as early as possible will conserve bandwidth. The only times I would not clumpify data are enumerated below. When should Clumpify not be used? There are a few cases where it probably won’t help: 1) For reads with a very low kmer depth, due to either very low coverage (like 1x WGS) or super-high-error-rate (like raw PacBio data). It won’t hurt anything but won’t accomplish anything either. 2) For large volumes of amplicon data. This may work, and may not work; but if all of your reads are expected to share the same kmers, they may all form one giant clump and again nothing will be accomplished. Again, it won’t hurt anything, and if pivots are randomly selected from variable regions, it might increase compression. 3) When your process is dependent on the order of reads. If you always grab the first million reads from a file assuming they are a good representation of the rest of the file, Clumpify will cause your assumption to be invalid – just like grabbing the first million reads from a sorted bam file would not be representative. Fortunately, this is never a good practice so if you are currently doing that, now would be a good opportunity to change your pipeline anyway. Randomly subsampling is a much better approach. 4) If you are only going to read data fewer than ~3 times, it will never go into long-term storage, and it's being used on local disk so bandwidth is not an issue, there's no point in using Clumpify (or gzip, for that matter). As always, please let me know if you have any questions, and please make sure you are using the latest version of BBTools when trying new functionality. P.S. For maximal compression, you can output bzipped files by using the .bz2 extension instead of .gz, if bzip2 or pbzip2 is installed. This is actually pretty fast if you have enough cores and pbzip2 installed, and furthermore, with enough cores, it decompresses even faster than gzip. This increases compression by around 9%. Last edited by Brian Bushnell; 12-04-2016 at 12:23 AM. |
![]() |
![]() |
![]() |
#2 |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,080
|
![]()
Can this be extended to identify PCR-duplicates and optionally flag or eliminate them?
Would piping output of clumpify into dedupe achieve fast de-duplication? |
![]() |
![]() |
![]() |
#3 |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,080
|
![]()
Going to put in a plug for tens of other things BBMap suite members can do. A compilation is available in this thread.
|
![]() |
![]() |
![]() |
#4 | ||
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]() Quote:
Quote:
|
||
![]() |
![]() |
![]() |
#5 | |
Junior Member
Location: Hong Kong Join Date: Jun 2015
Posts: 4
|
![]() Quote:
Great work Brian. |
|
![]() |
![]() |
![]() |
#6 | |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,080
|
![]() Quote:
Edit: On second thought that may not be practical/useful but I will leave the question in for now to see if @Brian has any pointers. For a 12G input gziped fastq file, clumpify made 28 temp files (each between 400-600M in size). Edit 2: Final file size was 6.8G so a significant reduction in size. Last edited by GenoMax; 12-06-2016 at 12:38 PM. |
|
![]() |
![]() |
![]() |
#7 |
Senior Member
Location: Cambridge Join Date: Sep 2010
Posts: 116
|
![]()
Dear Brian,
Thanks you very much for the tool, can be very helpful for io bound cloud folks. Are there any plans for including fasta support for aminoacid sequences (group the similar proteins together)? Must support very long fasta ID lines - up to 10Kb. |
![]() |
![]() |
![]() |
#8 | |||
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]() Quote:
Incidentally, Clumpify has an error-correction mode, but I was unable to get that to improve Megahit assemblies (even though it does improve Spades assemblies). Megahit has thus far been recalcitrant to my efforts to improve its assemblies with any form of error-correction, which I find somewhat upsetting ![]() Quote:
Quote:
Last edited by Brian Bushnell; 12-06-2016 at 10:48 AM. |
|||
![]() |
![]() |
![]() |
#9 | |
Senior Member
Location: Cambridge Join Date: Sep 2010
Posts: 116
|
![]() Quote:
1. parse fasta, reverse translate to DNA. Using a single codon for each aminoacid; 2. save as nt fastq; 3. clumpify; 4. parse fastq, translate; 5. save as aa fasta. |
|
![]() |
![]() |
![]() |
#10 | |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,080
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
#11 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
Whether you use Clumpify or CD-Hit, I'd be very interested if you could post the file size results before and after.
Incidentally, you can use BBTools to do AA <-> NT translation like this: Code:
translate6frames.sh in=proteins.faa.gz aain=t aaout=f out=nt.fna clumpify.sh in=nt.fna out=clumped.fna translate6frames.sh in=clumped.fna out=protein2.faa.gz frames=1 tag=f zl=6 |
![]() |
![]() |
![]() |
#12 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
I ran some benchmarks on 100x NextSeq E.coli data, to compare file sizes under various conditions:
This shows the file size, in bytes. Clumpified data is almost as small as mapped, sorted data, but takes much less time. The exact sizes were: Code:
100x.fq.gz 360829483 clumped.fq.gz 251014934 Code:
100x_binned.fq.gz 267955329 clumped_binned.fq.gz 161766626 This is the script I used to generate these sizes and times: Code:
time clumpify.sh in=100x.fq.gz out=clumped_noreorder.fq.gz time clumpify.sh in=100x.fq.gz out=clumped.fq.gz reorder time clumpify.sh in=100x.fq.gz out=clumped_lowram.fq.gz -Xmx1g time clumpify.sh in=100x.fq.gz out=clumped.fq.bz2 reorder time reformat.sh in=100x.fq.gz out=100x.fq.bz2 time bbmap.sh in=100x.fq.gz ref=ecoli_K12.fa.gz out=mapped.bam bs=bs.sh; time sh bs.sh reformat.sh in=mapped_sorted.bam out=sorted.fq.gz zl=6 reformat.sh in=mapped_sorted.bam out=sorted.sam.gz zl=6 reformat.sh in=mapped_sorted.bam out=sorted.fq.bz2 zl=6 |
![]() |
![]() |
![]() |
#13 |
Senior Member
Location: Berlin, DE Join Date: May 2008
Posts: 628
|
![]()
Interesting tool. Though I'd wish it could deal with "twin files", as these are the initial "raw files" of Illumina's bcl2fastq output. Additionally many tools require the pairs to be separated ... converting back and forth :-)
|
![]() |
![]() |
![]() |
#14 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
OK, I'll make a note of that... there's nothing preventing paired file support, it's just simpler to write for interleaved files when there are stages involving splitting into lots of temp files. But I can probably add it without too much difficulty.
|
![]() |
![]() |
![]() |
#15 |
Member
Location: New York Join Date: Dec 2016
Posts: 22
|
![]()
Hello Brian,
I started to use clumpify and the size was on average reduced ~25% for NextSeq Arabidopsis data. Thanks for the development! In a recent run for HiSeq maize data, I got an error for some (but not all) of the files. At first the run would stuck at fetching and eventually fail due to not enough memory (set 16G), despite the memory estimate was ~ 2G. HTML Code:
Clumpify version 36.71 Memory Estimate: 2685 MB Memory Available: 12836 MB Set groups to 1 Executing clump.KmerSort [in=input.fastq.bz2, out=clumped.fastq.gz, groups=1, ecco=false, rename=false, shortname=f, unpair=false, repair=false, namesort=false, ow=true, -Xmx16g, reorder=t] Making comparator. Made a comparator with k=31, seed=1, border=1, hashes=4 Starting cris 0. Fetching reads. Making fetch threads. Starting threads. Waiting for threads. =>> job killed: mem job total 17312912 kb exceeded limit 16777216 kb HTML Code:
Starting threads. Waiting for threads. Fetch time: 321.985 seconds. Closing input stream. Combining thread output. Combine time: 0.108 seconds. Sorting. Sort time: 33.708 seconds. Making clumps. /home/cc5544/bin/clumpify.sh: line 180: 45220 Killed |
![]() |
![]() |
![]() |
#16 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
It looks like in both cases, Clumpify did not run out of memory, but was killed by your job scheduling system or OS. This can happen sometimes when the job scheduler is designed to instantly kill processes when virtual memory exceeds a quota; it used to happen on JGI's cluster until we made some adjustments. The basic problem is this:
When Clumpify (or any other program) spawns a subprocess, that uses a fork operation, and the OS temporarily allocates twice the original virtual memory. It seems very strange to me, but here's what happens in practice: 1) You run Clumpify on a .bz2 file, and tell Clumpify to use 16 GB with the flag -Xmx16g, or similar. Even if it only needs 2GB of RAM to store the input, it will still use (slightly more than) 16 GB of virtual memory. 2) Clumpify sees that the input file is .bz2. Java cannot natively process bzipped files, so it starts a subprocess running bzip2 or pbzip2. That means a fork occurs and for a tiny fraction of a second the processes are using 32GB of virtual memory (even though at that point nothing has been loaded, so the physical memory being used is only 40 MB or so). After that fraction of a second, Clumpify will still be using 16 GB of virtual memory and 40 MB of physical memory, and the bzip2 process will be using a few MB of virtual and physical memory. 3) The job scheduler looks at the processes every once in a while to see how much memory they are using. If you are unlucky, it might look right at the exact moment of the fork. Then, if you only scheduled 16 GB and are using 32 GB of virtual memory, it will kill your process, even though you are only using 40 MB of physical memory at that time. Personally, I consider this to be a major bug in the job schedulers that have this behavior. Also, not allowing programs to over-commit virtual memory (meaning, use more virtual memory than is physically present) is generally a very bad idea. Virtual memory is free, after all. What job scheduler are you using? And do you know what your cluster's policy is for over-comitting virtual memory? I think that in this case the system will allow the program to execute and not kill it if you request 48 GB, but add the flag "-Xmx12g" to Clumpify. That way, even when it uses a fork operation to read the bzipped input, and potentially another fork operation to write the gzipped output with pigz, it will still stay under the 48 GB kill limit. Alternately you could decompress the input before running Clumpify and tell it not to use pigz with the pigz=f flag, but I think changing the memory settings is a better solution because that won't affect speed. As for the 25% file size reduction - that's fairly low for NextSeq data with binned quality scores; for gzip in and gzip out, I normally see ~39%. Clumpify can output bzipped data if you name the output file as .bz2; if your pipeline is compatible with bzipped data, that should increase the compression ratio a lot, since .bz2 files are smaller than .gz files. Of course, unless you are using pbzip2 the speed will be much lower; but with pbzip2 and enough cores, .bz2 files compress fast and decompress even faster than .gz. Anyway, please try with requesting 48GB and using the -Xmx12g flag (or alternately requesting 16GB and using -Xmx4g) and let me know if that resolves the problem. Oh, I should also mention that if you request 16GB, then even if the program is not doing any forks, you should NOT use the flag -Xmx16g, you should use something like -Xmx13g (roughly 85% of what you requested). Why? -Xmx16g sets the heap size, but Java needs some memory for other things too (like per-thread stack memory, memory for loading classes, memory for the virtual machine, etc). So if you need to set -Xmx manually because the memory autodetection does not work (in which case, I'd like to hear the details about what the program does when you don't define -Xmx, because I want to make it as easy to use as possible) then please allow some overhead. Requesting 16GB and using the flag -Xmx16G is something I would expect to always fail on systems that to not allow virtual memory overcommit. In other words, possibly, your first command would work fine if you just changed the -Xmx16g to -Xmx13g. |
![]() |
![]() |
![]() |
#17 | |||
Member
Location: New York Join Date: Dec 2016
Posts: 22
|
![]()
Thank you so much for the thorough explanation. I tried a couple things and please find the reports as follows:
Quote:
Quote:
HTML Code:
java.lang.OutOfMemoryError: GC overhead limit exceeded at stream.FASTQ.makeId(FASTQ.java:567) at stream.FASTQ.quadToRead(FASTQ.java:785) at stream.FASTQ.toReadList(FASTQ.java:710) at stream.FastqReadInputStream.fillBuffer(FastqReadInputStream.java:111) at stream.FastqReadInputStream.nextList(FastqReadInputStream.java:96) at stream.ConcurrentGenericReadInputStream$ReadThread.readLists(ConcurrentGenericReadInputStream.java:656) at stream.ConcurrentGenericReadInputStream$ReadThread.run(ConcurrentGenericReadInputStream.java:635) This program ran out of memory. Try increasing the -Xmx flag and using tool-specific memory-related parameters. Quote:
java -ea -Xmx17720m -Xms17720m -cp |
|||
![]() |
![]() |
![]() |
#18 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
Dear chiayi,
Is this data confidential, or is it possible for you to send it to me? I would really like to eliminate this kind of bug, but I'm not sure I can do it without the data that triggers the problem. |
![]() |
![]() |
![]() |
#19 |
Member
Location: New York Join Date: Dec 2016
Posts: 22
|
![]()
I would love that. I will send the access to you in message momentarily. Thank you so much!
|
![]() |
![]() |
![]() |
#20 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
OK, I figured out the problem. Clumpify initially decides whether or not to split the file into multiple groups depending on whether it will fit into memory. The memory estimation was taking into account gzip compression, but NOT bz2 compression. That's a very easy fix.
That only solves the last error (java.lang.OutOfMemoryError: GC overhead limit exceeded), not the first two. But once I fix it the last command will run fine. As for the first two errors, it looks like those are due to your cluster configuration being to aggressive about killing jobs; I recommend any type of job for which you see that message (which might just be limited to running BBTools on bz2 files) you use the -Xmx parameter with slightly under half of the ram you requested (e.g. -Xmx7g when requesting 16 GB). Last edited by Brian Bushnell; 12-18-2016 at 10:03 AM. |
![]() |
![]() |
![]() |
Tags |
bbduk, bbmap, bbmerge, clumpify, compression, pigz, reformat, tadpole |
Thread Tools | |
|
|