SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Fastq data compression with pigz darked89 Bioinformatics 5 07-22-2021 05:56 AM
Bisulphite Sequencing Proof-reading or NOT proof-reading? yog77 Epigenetics 1 01-25-2012 07:45 AM
wig file compression anna_vt Bioinformatics 2 02-19-2010 12:28 PM
Pac Bio Proof-of-Concept Science Paper on Single-Molecule Sequencing Technology Digi The Pipeline 6 10-01-2009 07:15 PM
Oxford Nanopore publishes proof-of-concept paper doxologist Literature Watch 3 03-05-2009 07:57 AM

Reply
 
Thread Tools
Old 08-10-2010, 05:41 AM   #1
jkbonfield
Senior Member
 
Location: Cambridge, UK

Join Date: Jul 2008
Posts: 146
Default Fastq compression - proof of concept

I decided to shove my proof of concept code out of the door for people to experiment with, as I do not have time to take this further myself. However I feel format specific compression tools are very worth while considering given that bioinformatics data has now grown staggeringly in the last few years.

I have two fastq compression tools (neither are "production quality" or supported, so beware). Ie these are experimental only.

Code for both are on the Sanger Institute ftp site at:

ftp://ftp.sanger.ac.uk/pub/jkb/fqzcomp-1.0.tar.gz
ftp://ftp.sanger.ac.uk/pub/jkb/rngcod13+fqzcomp.tar.gz

I benchmarked them on a couple data sets and compared them with other general purpose tools. 1Gb file to/from /dev/shm. 54bp sequences.

Code:
Prog           Size             Encode time     Decode time
------------------------------------------------------------------
raw            1073741745        (2.0)           (2.0)
lzopack         499049563        11.7             5.3
quicklz         497987198         7.7             7.7
quicklz -3      424803464        65.1             5.5
gzip -1         375071650        30.1            12.8
lzopack -9      368383765       469.8             5.2
xz -1           318229712       134.3            33.5
gzip -6 (def.)  316890291       108.2            10.9
szip -o3        277408698       131.6           171.3
bsc -m0pTcpf    256937105       120.6           141.1
xz              253249104      1438.5            29.3
bzip2           249508099       414.9           118.6
fastq2fqz (-3)  244921604        22.5            12.9
fastq2fqz (-5)  238350173        27.8            13.0
bsc -m1pTcpf    233012984       111.3           152.4
szip -o6        232242005       295.8           233.0
fqzcomp         229624382        22.3            55.6
bsc -m2pTcpf    220238875       132.9           166.2
In the above, raw is a UNIX cat command, for comparison. Some of these you may likely have never heard of, but see http://mattmahoney.net/dc/text.html for a comprehensive list of tools.

On a smaller set of 250000 108bp sequences, allowing me to go to town testing slower tools like paq8, we get this:

Code:
Prog            Size            Encode(s) Decode(s)
---------------------------------------------------
simple_c        34445161        2.637     9.216
comp(0)         34165620        2.388     4.036
gzip -3         27822202        3.140     0.822
gzip            26441159        9.356     0.751
xz -3           22971956        67.62     2.400
comp1           22465448        2.335     3.364
xz              22450796        103.5     2.509
fastq2fqz       21595974        1.536     0.967
bzip2           21340457        10.99     5.813
szip            20540942        14.98     16.55
comp2           20287020        2.935     4.737
bsc -m2pTcpf    19365157        8.826     10.95
fqzcomp(1Mb)    19136330        1.589     2.500
bsc -m3pTcp     19063073        23.50     17.28
lpaq -9         18534618        178.6     (~encode)
paq8 -8         17730550        6043.8    (~encode)
It's impressive to see just how well the state of the art general purpose text compression tools can do (paq), albeit at *extreme* cost in CPU time. I tend to think of these are a base-line to try and approach. Although they can be beaten with code for dedicated formats, it's typically going to be very hard to do so while still being faster than, say, bzip2.

So the tools:

fastq2fqz/fqz2fastq:

These use LZ77 and huffman encoding (both via zlib and the interlaced huffman encoder taken from the Staden Package's io_lib).

Hence it's particularly fast at decompression as is usual with all LZ+huffman programs. It can be tweaked to be marginally faster than gzip for decompression if we ditch the interlaced huffman encoding for quality values and just call zlib again, but zlib's entropy encoder is far slower so it slows down on encoding and also has poor compression ratios.

Either way, it's an order of magnitude faster than bzip (both encoding and decoing) while giving comparable compression ratios.

Note that this tool MUST have fixed size lines and it only supports ACGTN.

fqzcomp/defqzcomp

For this I experimented on using probabilistic modelling and a rangecoder for entropy encoding. I chose to use Michael Schindler's example source for this from http://www.compressconsult.com/rangecoder/.

The compression performance is very good. Encoding speed is particularly fine, even beating gzip -1, but decoding speed unfortunately is about half that of encoding so it's quite slow compared to many tools. I know there are faster entropy encoders out there, so I'm sure there is room for improvement on the speed. Even so, it runs fast compared to tools with comparable compression ratios.

The fqzcomp program should support variable length sequences unlike fastq2fqz. I'm not sure what dna letter it accepts, but probably anything.

James

edit: fixed link to the new version of Matt Mahoney's chart.

Last edited by jkbonfield; 08-10-2010 at 05:43 AM.
jkbonfield is offline   Reply With Quote
Old 08-10-2010, 06:53 AM   #2
adamdeluca
Member
 
Location: Iowa City, IA

Join Date: Jul 2010
Posts: 95
Default

Very interesting, I will give them a try.

Something to look at, there is a parallel implementation of bzip2
http://compression.ca/pbzip2/
adamdeluca is offline   Reply With Quote
Old 08-10-2010, 07:17 AM   #3
krobison
Senior Member
 
Location: Boston area

Join Date: Nov 2007
Posts: 747
Default

What would be really nice would be for some of these options to be available in the downstream tools themselves -- e.g. bwa & bowtie (as far as I know) need the input FASTQs decompressed. It would certainly be convenient if they could read the compressed formats (though bwa with short reads ends up reading them twice, so the overhead of decompressing twice might not be worth it).
krobison is offline   Reply With Quote
Old 08-10-2010, 07:23 AM   #4
jkbonfield
Senior Member
 
Location: Cambridge, UK

Join Date: Jul 2008
Posts: 146
Default

The bsc tool is also parallel, both multi-threaded and mpi capable. I disabled it for the purposes of benchmarking though, to be fair. See http://libbsc.com for more details. I've been quite impressed with it so far.

James
jkbonfield is offline   Reply With Quote
Old 08-10-2010, 12:37 PM   #5
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

bwa has been supporting gzip'ed fastq for nearly two years. Minor modification can make it work with bzip'ed or bsc'ed fastq files, although by design bwa cannot support multiple compression algorithms at the same time. Maq's gzip support is later and is only available in SVN. Bowtie accepts piping, so supporting compression or not does not matter too much.

BTW, I did not know bsc before, but it looks very impressive to me, too.

EDIT: a lot of free compressors (e.g. quicklz, bsc and rangecoder) are licensed under GPL or LGPL. This becomes annoying when we want to release source code under a permissive open source license (e.g. BSD and MIT/X11) such that everyone can use the library/tool freely. Another similar practical issue is the availability of other language bindings. gzip is by far the most widely supported library.

Last edited by lh3; 08-10-2010 at 01:00 PM.
lh3 is offline   Reply With Quote
Old 08-10-2010, 01:45 PM   #6
drio
Senior Member
 
Location: 4117'49"N / 24'42"E

Join Date: Oct 2008
Posts: 323
Default

Bfast has been also supporting gzip and bzip2 for a long time.
__________________
-drd
drio is offline   Reply With Quote
Old 08-10-2010, 04:12 PM   #7
jkbonfield
Senior Member
 
Location: Cambridge, UK

Join Date: Jul 2008
Posts: 146
Default

Yeah GPL can be a pain like that at times.

For what it's worth, I'm happy to release fastq2fqz and fqz2fastq under BSD. It's kind of trivial mix of zlib and staden io_lib anyway, both of which are already BSD.

The fqzcomp code was based on GPL code, although the basic design of what it does is trivial enough to rewrite using a more free library. (Hah! "more free" - that'll wind up the GPL crowd). I doubt I'd ever get the time though.

James

PS. I'm totally with you on gzip being ubiqitous in language bindings. It's also incredibly fast at decompression compared to most, so it's ideal for a lot of our use cases. It's good to see many tools using at least some sort of on-the-fly compression.
jkbonfield is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:21 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO