View Single Post
Old 07-23-2021, 08:13 PM   #19
Junior Member
Location: Australia

Join Date: Jul 2021
Posts: 8

Thanks Andrey for the question. A few points where I think Genozip provides some benefits over CRAM:

1. Similar to CRAM, Genozip compresses each field of the SAM/BAM data separately, with the best codec for the particular type of data applied to each field. However, Genozip goes beyond that, and also leverages correlations *between* fields to further eliminate information redundancies. As a result, the compressed file is about 20% smaller than CRAM (according to our benchmark in the paper).

2. Genozip is not specific to SAM data - it can compress FASTQ, VCF and other genomic formats.

3. It is able to compress & archive whole directories directly into a tar file, eg: genozip *.bam --tar mydata.tar

4. It is highly scalable with cores - it has been tested to scale up to 100+ cores.

5. Genozip can compress BAM with or without a reference file, while CRAM requires a reference file. Compressing with a reference file in Genozip improves the compression ratio, in particular for low-coverage data, but for high-coverage data (eg 30x) Genozip can reach almost the same compression ratio without a reference file.

6. Genozip, through the command genocat, provides some interesting capabilities. Some of them similar to samtools, and some unique - for example, directly filtering out contamination from a BAM file using kraken2 data.

See the publication here:

And the software documentation here:

Last edited by divon; 07-23-2021 at 08:27 PM.
divon is offline   Reply With Quote