Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • G-SQZ: new tool to compress and index NGS reads

    available at no cost under a non-open-source license by requesting from the web-site; Binary: available for direct download at no cost. For-Profit: Submit request for for-profit license from the web-site.

    Bioinformatics. 2010 Jul 6. [Epub ahead of print]
    G-SQZ: Compact Encoding of Genomic Sequence and Quality Data.

    Tembe W, Lowey J, Suh E.

    Translational Genomics Research Institute, 445 N 5th Street, Phoenix, AZ 85004, USA.
    Abstract

    SUMMARY: Large volumes of data generated by high-throughput sequencing instruments present non-trivial challenges in data storage, content access, and transfer. We present G-SQZ, a Huffman coding-based sequencing-reads specific representation scheme that compresses data without altering the relative order. G-SQZ has achieved from 65% to 81% compression on benchmark datasets, and it allows selective access without scanning and decoding from start. This paper focuses on describing the underlying encoding scheme and its software implementation, and a more theoretical problem of optimal compression is out of scope. The immediate practical benefits include reduced infrastructure and informatics costs in managing and analyzing large sequencing data. AVAILABILITY: http://public.tgen.org/sqz. Academic/non-profit: Source: available at no cost under a non-open-source license by requesting from the web-site; Binary: available for direct download at no cost. For-Profit: Submit request for for-profit license from the web-site. CONTACT: Waibhav Tembe ([email protected]).

    I am not affliated with the author btw
    http://kevin-gattaca.blogspot.com/

  • #2
    Originally posted by KevinLam View Post
    http://www.ncbi.nlm.nih.gov/pubmed/20605925
    Bioinformatics. 2010 Jul 6. [Epub ahead of print]
    G-SQZ: Compact Encoding of Genomic Sequence and Quality Data.

    Tembe W, Lowey J, Suh E.

    Translational Genomics Research Institute, 445 N 5th Street, Phoenix, AZ 85004, USA.
    Abstract

    SUMMARY: Large volumes of data generated by high-throughput sequencing instruments present non-trivial challenges in data storage, content access, and transfer. We present G-SQZ, a Huffman coding-based sequencing-reads specific representation scheme that compresses data without altering the relative order. G-SQZ has achieved from 65% to 81% compression on benchmark datasets, and it allows selective access without scanning and decoding from start. This paper focuses on describing the underlying encoding scheme and its software implementation, and a more theoretical problem of optimal compression is out of scope. The immediate practical benefits include reduced infrastructure and informatics costs in managing and analyzing large sequencing data. AVAILABILITY: http://public.tgen.org/sqz. Academic/non-profit: Source: available at no cost under a non-open-source license by requesting from the web-site; Binary: available for direct download at no cost. For-Profit: Submit request for for-profit license from the web-site. CONTACT: Waibhav Tembe ([email protected]).

    I am not affliated with the author btw
    Here's how I would use it, how would you?

    This could be extremely useful as datasets get larger. For example, I first index (and compress) my reads. Then if I parallelize my alignment, say by having many jobs, I can have those jobs read ranges of non-overlapping sequence reads. The index scheme would allow for an quick jump to the correct start of that range instead of reading through to the start (or something like that).

    Comment


    • #3
      One main problem I have with NGS reads is de novo assemblies.
      I could potentially 'reduce' the number of reads to my organism of interest in a mixed environmental sample to the ones that I have by bfast / bwa alignment to genome then retrieve the reads to do de novo assembly again.
      if the reads are sufficiently small in number I could even use applications like phrap to do the assembly..
      http://kevin-gattaca.blogspot.com/

      Comment


      • #4
        Did you experiment with range coders instead of huffman? I've had some good results with these.

        I have a couple homebrew noddy fastq compressors here, neither of which we actually use in anger. I don't have time to support this, but can put code on an ftp site if people want. An example from SRR013951_2.fastq.gz downloaded from EBI (I couldn't even figure out how to get a download from NCBI as it's permanently greyed out).


        jkb@seq1d[scratch/data] gzip -cd ~/scratch/data/SRR013951_2.fastq.gz | head -10000000 > 1mill.fastq
        jkb@seq1d[scratch/data] ls -l 1mill.fastq
        -rw-rw-r-- 1 jkb team117 499441178 Jul 15 09:57 1mill.fastq

        jkb@seq1d[scratch/data] time gzip < 1mill.fastq |wc -c
        216248482
        real 1m13.814s
        user 1m13.169s
        sys 0m0.764s

        jkb@seq1d[scratch/data] time bzip2 < 1mill.fastq
        182653950
        real 1m35.292s
        user 1m34.382s
        sys 0m0.920s

        jkb@seq1d[scratch/data] time ~/work/solexa/fastq2fqz < 1mill.fastq > 1mill.fqz_h;wc -c 1mill.fqz_h
        real 0m37.521s
        user 0m24.106s
        sys 0m1.784s
        170126704 1mill.fqz_h

        jkb@seq1d[scratch/data] time ~/ftp/rngcod13/fqzcomp < 1mill.fastq > 1mill.fqz_r; wc -c 1mill.fqz_r
        real 0m12.377s
        user 0m11.805s
        sys 0m0.544s
        167735961 1mill.fqz_r

        Here fastq2fqz uses huffman encoding in blocks of 1 million sequences at a time. It encodes in 3 parts per block - all the sequence names, all the sequences, and then all the qualities. I use my huffman implementation from io_lib (compatible with deflate except without any LZ component).

        fqzcomp is built on top of Michael Schindler's rangecoder package. It had a block size of 1million lines. It uses order-1 stats for encoding, and separate streams again for names, sequences and qualities. (Also for sequence length.)

        Huffman is a lot quicker at decompression than a rangecoder though, so it wins out there. That small set took 5.5sec gunzip, 10.2s fqz2fastq (huffman) and 28.1s defqzcomp (rng coder).
        Last edited by jkbonfield; 07-15-2010, 01:16 AM.

        Comment


        • #5
          @jkbonfield

          Can you quickly get a range of reads from your compressed file (a.la.bam)?

          Comment


          • #6
            This was purely a fastq compression - so unaligned data. However the theory was that it could be made random access still by file offset. I just never implemented it (and probably won't as it was nothing more than an intellectual exercise to see what's achieveable).

            The idea of reading in a block of data, compressing it independent, and then outputting it again is basically that it means that you can trivially keep a map of compressed to uncompressed block coordinates (exactly like bgzf does) and then any random position is a combination of load_block + decompress_block + seek into decompressed data.

            My block sizes were somewhat huge, although even with 10k seqs per block it doesn't do that bad. I should probably tidy it up and punt it to an ftp site so at least people can experiment and pick it up to work on if they're bothered.

            Comment


            • #7
              I obtained both the g-squeeze software and one of their data files, and what the paper manages to gloss over is speed. It's SLOOOW. So much so I feel somewhat mean publishing these results, especially somewhere the authors may not be reading!

              However I did wonder this when reading the paper - by their own tests they show it's better than gzip and almost as good as bzip2. That left me wondering - why not just add an index to bzip2 (it's block based already) and use that instead?

              1 million sequences from SRR013951_2.filt.fastq

              Orig: 198Mb
              Gzip: 85Mb, 23.3s, 3.0s
              Bzip2: 71Mb, 28.2s, 14.1s
              gsqz: 71Mb, 465.9s, 223.9s
              fqzcomp: 66Mb, 3.7s, 8.8s

              Times there are compression time followed by decompression time. Gzip is clearly still the winner for sheer speed of decompression, which is something lz encoders have always excelled at.

              Where gsqz wins is that it can do semi-random access by seeking to a block of data as it keeps an index. Personally I think it would be better adding that feature to another format.

              Combining sequence and quality together to form a single symbol is an interesting concept. Although I suspect it amounts of maq compression now I think about it given that it used 2 bits of base, 6 bits of quality, and shove the lot through zlib. (gsqz can cope with more than 2 bits of base though including SOLiD and ambiguity codes.)

              Comment


              • #8
                Instead of adding an index to bzip, one could also put bzip in front of BioHDF

                Comment


                • #9
                  Well, as you mention bioHDF, I gave it a whirl.

                  I had to hack the fastq format a bit and to make it comparable to gsqz which keeps both the sequence name and the comment/remark data after it I pasted all those together separated by underscore instead.

                  h5_import_read, 74Mb, 20.1s
                  gap5, 74Mb, 23.1s
                  bam, 84Mb, 15.8s

                  I couldn't be bothered testing decode times so that's just encoding.
                  For gap5 I produced faked coordinates too, with each read being 1bp further along than the next. Therefore it offers true random access as well. Bam sort of handles this via bgzf library, but faking up positions to allow samtools view to extract arbitrary chunks is also possible (and boosted it to 87Mb).

                  All use zlib style compression and hardly suprisingly bam is comparable size to gzip while h5 and gap5 are comparable to each other (and close to gsqz) due to more column-oriented compression instead of mixing data types together.

                  Ie nothing unexpected really - all seem workable solutions, although I'd like Bam version 2 to consider column oriented data compression instead as it costs no CPU but saves considerable disc space.

                  James

                  PS. It's worth saying I do NOT feel this is appropriate use of gap5 though! It's designed for editing aligned assemblies, not storing unaligned data. :-)

                  Comment


                  • #10
                    As far as BioHDF goes, compression format indexes for random access don't really matter. HDF5 (the underlying storage of BioHDF) has its own internal indexing and data chunking scheme and those chunks are compressed independently so any extra data maintained by the compression scheme is just wasted overhead in time and space. Most compression schemes can be used with a little bit of glue code though we use zlib style compression by default.

                    Regardless of any file size savings, the slow decompression is troubling, especially if you are writing data browser software where slow seeks will be highly visible to the user.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM
                    • seqadmin
                      Techniques and Challenges in Conservation Genomics
                      by seqadmin



                      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                      Avian Conservation
                      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                      03-08-2024, 10:41 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Yesterday, 06:37 PM
                    0 responses
                    10 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, Yesterday, 06:07 PM
                    0 responses
                    9 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-22-2024, 10:03 AM
                    0 responses
                    49 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-21-2024, 07:32 AM
                    0 responses
                    67 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X