I decided to shove my proof of concept code out of the door for people to experiment with, as I do not have time to take this further myself. However I feel format specific compression tools are very worth while considering given that bioinformatics data has now grown staggeringly in the last few years.
I have two fastq compression tools (neither are "production quality" or supported, so beware). Ie these are experimental only.
Code for both are on the Sanger Institute ftp site at:
ftp://ftp.sanger.ac.uk/pub/jkb/fqzcomp-1.0.tar.gz
ftp://ftp.sanger.ac.uk/pub/jkb/rngcod13+fqzcomp.tar.gz
I benchmarked them on a couple data sets and compared them with other general purpose tools. 1Gb file to/from /dev/shm. 54bp sequences.
In the above, raw is a UNIX cat command, for comparison. Some of these you may likely have never heard of, but see http://mattmahoney.net/dc/text.html for a comprehensive list of tools.
On a smaller set of 250000 108bp sequences, allowing me to go to town testing slower tools like paq8, we get this:
It's impressive to see just how well the state of the art general purpose text compression tools can do (paq), albeit at *extreme* cost in CPU time. I tend to think of these are a base-line to try and approach. Although they can be beaten with code for dedicated formats, it's typically going to be very hard to do so while still being faster than, say, bzip2.
So the tools:
fastq2fqz/fqz2fastq:
These use LZ77 and huffman encoding (both via zlib and the interlaced huffman encoder taken from the Staden Package's io_lib).
Hence it's particularly fast at decompression as is usual with all LZ+huffman programs. It can be tweaked to be marginally faster than gzip for decompression if we ditch the interlaced huffman encoding for quality values and just call zlib again, but zlib's entropy encoder is far slower so it slows down on encoding and also has poor compression ratios.
Either way, it's an order of magnitude faster than bzip (both encoding and decoing) while giving comparable compression ratios.
Note that this tool MUST have fixed size lines and it only supports ACGTN.
fqzcomp/defqzcomp
For this I experimented on using probabilistic modelling and a rangecoder for entropy encoding. I chose to use Michael Schindler's example source for this from http://www.compressconsult.com/rangecoder/.
The compression performance is very good. Encoding speed is particularly fine, even beating gzip -1, but decoding speed unfortunately is about half that of encoding so it's quite slow compared to many tools. I know there are faster entropy encoders out there, so I'm sure there is room for improvement on the speed. Even so, it runs fast compared to tools with comparable compression ratios.
The fqzcomp program should support variable length sequences unlike fastq2fqz. I'm not sure what dna letter it accepts, but probably anything.
James
edit: fixed link to the new version of Matt Mahoney's chart.
I have two fastq compression tools (neither are "production quality" or supported, so beware). Ie these are experimental only.
Code for both are on the Sanger Institute ftp site at:
ftp://ftp.sanger.ac.uk/pub/jkb/fqzcomp-1.0.tar.gz
ftp://ftp.sanger.ac.uk/pub/jkb/rngcod13+fqzcomp.tar.gz
I benchmarked them on a couple data sets and compared them with other general purpose tools. 1Gb file to/from /dev/shm. 54bp sequences.
Code:
Prog Size Encode time Decode time ------------------------------------------------------------------ raw 1073741745 (2.0) (2.0) lzopack 499049563 11.7 5.3 quicklz 497987198 7.7 7.7 quicklz -3 424803464 65.1 5.5 gzip -1 375071650 30.1 12.8 lzopack -9 368383765 469.8 5.2 xz -1 318229712 134.3 33.5 gzip -6 (def.) 316890291 108.2 10.9 szip -o3 277408698 131.6 171.3 bsc -m0pTcpf 256937105 120.6 141.1 xz 253249104 1438.5 29.3 bzip2 249508099 414.9 118.6 fastq2fqz (-3) 244921604 22.5 12.9 fastq2fqz (-5) 238350173 27.8 13.0 bsc -m1pTcpf 233012984 111.3 152.4 szip -o6 232242005 295.8 233.0 fqzcomp 229624382 22.3 55.6 bsc -m2pTcpf 220238875 132.9 166.2
On a smaller set of 250000 108bp sequences, allowing me to go to town testing slower tools like paq8, we get this:
Code:
Prog Size Encode(s) Decode(s) --------------------------------------------------- simple_c 34445161 2.637 9.216 comp(0) 34165620 2.388 4.036 gzip -3 27822202 3.140 0.822 gzip 26441159 9.356 0.751 xz -3 22971956 67.62 2.400 comp1 22465448 2.335 3.364 xz 22450796 103.5 2.509 fastq2fqz 21595974 1.536 0.967 bzip2 21340457 10.99 5.813 szip 20540942 14.98 16.55 comp2 20287020 2.935 4.737 bsc -m2pTcpf 19365157 8.826 10.95 fqzcomp(1Mb) 19136330 1.589 2.500 bsc -m3pTcp 19063073 23.50 17.28 lpaq -9 18534618 178.6 (~encode) paq8 -8 17730550 6043.8 (~encode)
So the tools:
fastq2fqz/fqz2fastq:
These use LZ77 and huffman encoding (both via zlib and the interlaced huffman encoder taken from the Staden Package's io_lib).
Hence it's particularly fast at decompression as is usual with all LZ+huffman programs. It can be tweaked to be marginally faster than gzip for decompression if we ditch the interlaced huffman encoding for quality values and just call zlib again, but zlib's entropy encoder is far slower so it slows down on encoding and also has poor compression ratios.
Either way, it's an order of magnitude faster than bzip (both encoding and decoing) while giving comparable compression ratios.
Note that this tool MUST have fixed size lines and it only supports ACGTN.
fqzcomp/defqzcomp
For this I experimented on using probabilistic modelling and a rangecoder for entropy encoding. I chose to use Michael Schindler's example source for this from http://www.compressconsult.com/rangecoder/.
The compression performance is very good. Encoding speed is particularly fine, even beating gzip -1, but decoding speed unfortunately is about half that of encoding so it's quite slow compared to many tools. I know there are faster entropy encoders out there, so I'm sure there is room for improvement on the speed. Even so, it runs fast compared to tools with comparable compression ratios.
The fqzcomp program should support variable length sequences unlike fastq2fqz. I'm not sure what dna letter it accepts, but probably anything.
James
edit: fixed link to the new version of Matt Mahoney's chart.
Comment