SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Control-FREEC: a tool for assessing copy number and allelic content using NGS data valeu Literature Watch 76 09-22-2016 02:16 AM
no index reads Tobias Kockmann Illumina/Solexa 15 05-22-2012 04:01 PM
Visualization tool to compare distribution profiles of NGS reads? Kennels Bioinformatics 9 04-13-2011 08:32 PM
Mosaik: possible for MosaikText to create SAM or BAM, but not compress it? bbimber Bioinformatics 3 05-27-2010 07:59 AM
Best tool to map 454 reads onto sanger reads? dan Bioinformatics 3 07-27-2009 09:51 AM

Reply
 
Thread Tools
Old 07-14-2010, 02:34 AM   #1
KevinLam
Senior Member
 
Location: SEA

Join Date: Nov 2009
Posts: 197
Talking G-SQZ: new tool to compress and index NGS reads

http://www.ncbi.nlm.nih.gov/pubmed/20605925
Bioinformatics. 2010 Jul 6. [Epub ahead of print]
G-SQZ: Compact Encoding of Genomic Sequence and Quality Data.

Tembe W, Lowey J, Suh E.

Translational Genomics Research Institute, 445 N 5th Street, Phoenix, AZ 85004, USA.
Abstract

SUMMARY: Large volumes of data generated by high-throughput sequencing instruments present non-trivial challenges in data storage, content access, and transfer. We present G-SQZ, a Huffman coding-based sequencing-reads specific representation scheme that compresses data without altering the relative order. G-SQZ has achieved from 65% to 81% compression on benchmark datasets, and it allows selective access without scanning and decoding from start. This paper focuses on describing the underlying encoding scheme and its software implementation, and a more theoretical problem of optimal compression is out of scope. The immediate practical benefits include reduced infrastructure and informatics costs in managing and analyzing large sequencing data. AVAILABILITY: http://public.tgen.org/sqz. Academic/non-profit: Source: available at no cost under a non-open-source license by requesting from the web-site; Binary: available for direct download at no cost. For-Profit: Submit request for for-profit license from the web-site. CONTACT: Waibhav Tembe ([email protected]).

I am not affliated with the author btw
KevinLam is offline   Reply With Quote
Old 07-14-2010, 08:37 AM   #2
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by KevinLam View Post
http://www.ncbi.nlm.nih.gov/pubmed/20605925
Bioinformatics. 2010 Jul 6. [Epub ahead of print]
G-SQZ: Compact Encoding of Genomic Sequence and Quality Data.

Tembe W, Lowey J, Suh E.

Translational Genomics Research Institute, 445 N 5th Street, Phoenix, AZ 85004, USA.
Abstract

SUMMARY: Large volumes of data generated by high-throughput sequencing instruments present non-trivial challenges in data storage, content access, and transfer. We present G-SQZ, a Huffman coding-based sequencing-reads specific representation scheme that compresses data without altering the relative order. G-SQZ has achieved from 65% to 81% compression on benchmark datasets, and it allows selective access without scanning and decoding from start. This paper focuses on describing the underlying encoding scheme and its software implementation, and a more theoretical problem of optimal compression is out of scope. The immediate practical benefits include reduced infrastructure and informatics costs in managing and analyzing large sequencing data. AVAILABILITY: http://public.tgen.org/sqz. Academic/non-profit: Source: available at no cost under a non-open-source license by requesting from the web-site; Binary: available for direct download at no cost. For-Profit: Submit request for for-profit license from the web-site. CONTACT: Waibhav Tembe ([email protected]).

I am not affliated with the author btw
Here's how I would use it, how would you?

This could be extremely useful as datasets get larger. For example, I first index (and compress) my reads. Then if I parallelize my alignment, say by having many jobs, I can have those jobs read ranges of non-overlapping sequence reads. The index scheme would allow for an quick jump to the correct start of that range instead of reading through to the start (or something like that).
nilshomer is offline   Reply With Quote
Old 07-14-2010, 07:04 PM   #3
KevinLam
Senior Member
 
Location: SEA

Join Date: Nov 2009
Posts: 197
Default

One main problem I have with NGS reads is de novo assemblies.
I could potentially 'reduce' the number of reads to my organism of interest in a mixed environmental sample to the ones that I have by bfast / bwa alignment to genome then retrieve the reads to do de novo assembly again.
if the reads are sufficiently small in number I could even use applications like phrap to do the assembly..
KevinLam is offline   Reply With Quote
Old 07-15-2010, 02:12 AM   #4
jkbonfield
Senior Member
 
Location: Cambridge, UK

Join Date: Jul 2008
Posts: 143
Default

Did you experiment with range coders instead of huffman? I've had some good results with these.

I have a couple homebrew noddy fastq compressors here, neither of which we actually use in anger. I don't have time to support this, but can put code on an ftp site if people want. An example from SRR013951_2.fastq.gz downloaded from EBI (I couldn't even figure out how to get a download from NCBI as it's permanently greyed out).


[email protected][scratch/data] gzip -cd ~/scratch/data/SRR013951_2.fastq.gz | head -10000000 > 1mill.fastq
[email protected][scratch/data] ls -l 1mill.fastq
-rw-rw-r-- 1 jkb team117 499441178 Jul 15 09:57 1mill.fastq

[email protected][scratch/data] time gzip < 1mill.fastq |wc -c
216248482
real 1m13.814s
user 1m13.169s
sys 0m0.764s

[email protected][scratch/data] time bzip2 < 1mill.fastq
182653950
real 1m35.292s
user 1m34.382s
sys 0m0.920s

[email protected][scratch/data] time ~/work/solexa/fastq2fqz < 1mill.fastq > 1mill.fqz_h;wc -c 1mill.fqz_h
real 0m37.521s
user 0m24.106s
sys 0m1.784s
170126704 1mill.fqz_h

[email protected][scratch/data] time ~/ftp/rngcod13/fqzcomp < 1mill.fastq > 1mill.fqz_r; wc -c 1mill.fqz_r
real 0m12.377s
user 0m11.805s
sys 0m0.544s
167735961 1mill.fqz_r

Here fastq2fqz uses huffman encoding in blocks of 1 million sequences at a time. It encodes in 3 parts per block - all the sequence names, all the sequences, and then all the qualities. I use my huffman implementation from io_lib (compatible with deflate except without any LZ component).

fqzcomp is built on top of Michael Schindler's rangecoder package. It had a block size of 1million lines. It uses order-1 stats for encoding, and separate streams again for names, sequences and qualities. (Also for sequence length.)

Huffman is a lot quicker at decompression than a rangecoder though, so it wins out there. That small set took 5.5sec gunzip, 10.2s fqz2fastq (huffman) and 28.1s defqzcomp (rng coder).

Last edited by jkbonfield; 07-15-2010 at 02:16 AM.
jkbonfield is offline   Reply With Quote
Old 07-15-2010, 09:20 AM   #5
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

@jkbonfield

Can you quickly get a range of reads from your compressed file (a.la.bam)?
nilshomer is offline   Reply With Quote
Old 07-15-2010, 09:27 AM   #6
jkbonfield
Senior Member
 
Location: Cambridge, UK

Join Date: Jul 2008
Posts: 143
Default

This was purely a fastq compression - so unaligned data. However the theory was that it could be made random access still by file offset. I just never implemented it (and probably won't as it was nothing more than an intellectual exercise to see what's achieveable).

The idea of reading in a block of data, compressing it independent, and then outputting it again is basically that it means that you can trivially keep a map of compressed to uncompressed block coordinates (exactly like bgzf does) and then any random position is a combination of load_block + decompress_block + seek into decompressed data.

My block sizes were somewhat huge, although even with 10k seqs per block it doesn't do that bad. I should probably tidy it up and punt it to an ftp site so at least people can experiment and pick it up to work on if they're bothered.
jkbonfield is offline   Reply With Quote
Old 07-15-2010, 09:41 AM   #7
jkbonfield
Senior Member
 
Location: Cambridge, UK

Join Date: Jul 2008
Posts: 143
Default

I obtained both the g-squeeze software and one of their data files, and what the paper manages to gloss over is speed. It's SLOOOW. So much so I feel somewhat mean publishing these results, especially somewhere the authors may not be reading!

However I did wonder this when reading the paper - by their own tests they show it's better than gzip and almost as good as bzip2. That left me wondering - why not just add an index to bzip2 (it's block based already) and use that instead?

1 million sequences from SRR013951_2.filt.fastq

Orig: 198Mb
Gzip: 85Mb, 23.3s, 3.0s
Bzip2: 71Mb, 28.2s, 14.1s
gsqz: 71Mb, 465.9s, 223.9s
fqzcomp: 66Mb, 3.7s, 8.8s

Times there are compression time followed by decompression time. Gzip is clearly still the winner for sheer speed of decompression, which is something lz encoders have always excelled at.

Where gsqz wins is that it can do semi-random access by seeking to a block of data as it keeps an index. Personally I think it would be better adding that feature to another format.

Combining sequence and quality together to form a single symbol is an interesting concept. Although I suspect it amounts of maq compression now I think about it given that it used 2 bits of base, 6 bits of quality, and shove the lot through zlib. (gsqz can cope with more than 2 bits of base though including SOLiD and ambiguity codes.)
jkbonfield is offline   Reply With Quote
Old 07-15-2010, 01:45 PM   #8
todd
Junior Member
 
Location: Seattle

Join Date: Feb 2008
Posts: 4
Default

Instead of adding an index to bzip, one could also put bzip in front of BioHDF
todd is offline   Reply With Quote
Old 07-16-2010, 02:14 AM   #9
jkbonfield
Senior Member
 
Location: Cambridge, UK

Join Date: Jul 2008
Posts: 143
Default

Well, as you mention bioHDF, I gave it a whirl.

I had to hack the fastq format a bit and to make it comparable to gsqz which keeps both the sequence name and the comment/remark data after it I pasted all those together separated by underscore instead.

h5_import_read, 74Mb, 20.1s
gap5, 74Mb, 23.1s
bam, 84Mb, 15.8s

I couldn't be bothered testing decode times so that's just encoding.
For gap5 I produced faked coordinates too, with each read being 1bp further along than the next. Therefore it offers true random access as well. Bam sort of handles this via bgzf library, but faking up positions to allow samtools view to extract arbitrary chunks is also possible (and boosted it to 87Mb).

All use zlib style compression and hardly suprisingly bam is comparable size to gzip while h5 and gap5 are comparable to each other (and close to gsqz) due to more column-oriented compression instead of mixing data types together.

Ie nothing unexpected really - all seem workable solutions, although I'd like Bam version 2 to consider column oriented data compression instead as it costs no CPU but saves considerable disc space.

James

PS. It's worth saying I do NOT feel this is appropriate use of gap5 though! It's designed for editing aligned assemblies, not storing unaligned data. :-)
jkbonfield is offline   Reply With Quote
Old 07-16-2010, 12:02 PM   #10
derobins
Junior Member
 
Location: Champaign, Illinois

Join Date: Jul 2010
Posts: 4
Default

As far as BioHDF goes, compression format indexes for random access don't really matter. HDF5 (the underlying storage of BioHDF) has its own internal indexing and data chunking scheme and those chunks are compressed independently so any extra data maintained by the compression scheme is just wasted overhead in time and space. Most compression schemes can be used with a little bit of glue code though we use zlib style compression by default.

Regardless of any file size savings, the slow decompression is troubling, especially if you are writing data browser software where slow seeks will be highly visible to the user.
derobins is offline   Reply With Quote
Reply

Tags
index, ngs, reads, software, tools

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:22 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO