SEQanswers (
-   General (
-   -   FastQ/BAM compression (

tir_al 01-09-2013 08:05 AM

FastQ/BAM compression
Does anybody know of a more recent comparison of algorithms for fastq/bam compression, than this thread?


GenoMax 01-09-2013 08:10 AM

CRAM is reference based compression so may or may not be of interest for you.

tir_al 01-09-2013 08:34 AM

Tnx for the prompt answer!

I just tried cram today. The compression ratio is extremely impressive, but it's a too slow for my needs.

winsettz 01-09-2013 02:20 PM


What is the compression ratio? Curious to see before I take a dive with my own data.

tir_al 01-09-2013 02:26 PM

I tried it on 75bp, cca 80 mil. reads, pair end bam file. And it crammed 3.6 GB file into a 257MB archive.

winsettz 01-09-2013 04:23 PM


Originally Posted by tir_al (Post 93300)
I tried it on 75bp, cca 80 mil. reads, pair end bam file. And it crammed 3.6 GB file into a 257MB archive.

Sounds like a promising way to store genomic data in the long run if indexed to hg19?

tir_al 01-09-2013 04:25 PM

Yeah. Preferably for storing old projects.

GenoMax 01-10-2013 06:18 AM


Originally Posted by tir_al (Post 93313)
Yeah. Preferably for storing old projects.

Are you sure the effort is going to be worthwhile rather than using plain old tar/gzip combination?

If you are looking at thousands of samples a year then perhaps it may be.

tir_al 01-10-2013 06:50 AM

I currently have no options, and no place for new disk space :)

bruce01 01-11-2013 02:41 AM

Hi all, I can't figure out how to specify loseless compression using cramTools (ie retain all quality score info), can someone help me out? In NGCs paper they state a few flags which seem to be discontinued in 1.0. Presumably I specify using --lossy-quality-score-spec but I can't figure out how to set it to 'any/all'. I appreciate any help/ideas on this. Also if I am missing the point and the compression inherently removes quality scores I apologise in advance, I am n00 to the area=P

priesgo 01-16-2013 01:46 AM


I'm in the same situation as Bruce. I want to compress keeping the base call qualities but can't figure out how...

Just did a naf try:

--lossy-quality-score-spec all
and got:

Exception in thread "main" java.lang.RuntimeException: Uknown read or base category: a
        at net.sf.cram.lossy.QualityScorePreservation.parseSinglePolicy(

So apparently you can specify a read name to keep the quality, not of use in my case as I want to keep all of them; and base category. But, what is the base category? I also tried a numeric value in case it referred referred to an index in the read, but with similar result.


bruce01 01-16-2013 02:44 AM

Hi Pablo,

found it on the archives of cram mailing list, the call includes: -L m999 (-L flag is your --lossy-quality-score-spec above). All reads are retained, but you lose column 12+ ('info'). This isn't an issue for me, and the compressed cram file for a 1GB bam is 600MB which is pretty good!

priesgo 01-16-2013 02:59 AM

Thanks Bruce,

It's running!
For the columns 12+ I guess with the option --capture-tags you can keep tags as the read group which is usually important.


jkbonfield 02-04-2013 01:38 AM

I just noticed this thread, rather late.

There is CRAM from EBI, which has long term support and handles random access. It's the most direct competitor to BAM I would guess.

Alternatives are Goby (similar ratios, but even slower from my experience), Quip (faster encoding, great compression ratio, but no(?) random access) and SamComp1/2 (faster encoding, great compression ratio, no random access, and doesn't really implement the full SAM spec - more of a fastq compressor). Finally on that topic there are tools like quip again, fqzcomp and fastqz for compression of FASTQ data. [All 3 of these were SequenceSqueeze competition entries.]

narain 05-14-2013 11:53 AM

But as I saw in one of the presentations, it seems CRAM does a lossy conversion from BAM, and introduces false positive and false negatives ? Is CRAM mature now to do a lossless compression from FASTQ and BAM files with random access such as BAM files give ?

jkbonfield 05-15-2013 01:33 AM

CRAM has both lossy and lossless modes. My own C library currently only supports lossless encoding (but can handle decoding of lossily encoded CRAM files). Vadim's Java provides options for both lossy and lossless encoding.

As for maturity - I'd say it's pretty close now with CRAM v2.0. I'm biased of course[1], but try the latest Staden io_lib package and run the "scramble" command once built:

Approx 1Gb bam file:
jkb[/tmp] ls -l 6714_6#1.bam
-rw-r--r-- 1 jkb team117 977124408 Apr 23 10:20 6714_6#1.bam

Locally specified reference (scramble will use the UR:file: field or access the EBI's MD5 server to pull down the reference automatically; otherwise use -r to specify the .fa location). Redacted slightly because I've no idea if this is public data or not.
jkb[/tmp] samtools view -H 6714_6#1.bam | egrep '^@SQ'
@SQ SN:<...> LN:2892523 UR:file:/nfs/srpipe_references/references/<...> M5:76f500<...>

Convert to CRAM losslessly, 38% less disk space used:
jkb[/tmp] time ./io_lib-1.13.1/progs/scramble 6714_6#1.bam 6714_6#1.cram
real 2m37.763s
user 2m31.753s
sys 0m3.564s
jkb@deskpro102485[/tmp] ls -l 6714_6#1.cram
-rw-r--r-- 1 jkb team117 608320844 May 15 09:23 6714_6#1.cram

Convert back to BAM again. "-m" indicates to generate MD and NM tags:
jkb@deskpro102485[/tmp] time ./io_lib-1.13.1/progs/scramble -m 6714_6#1.cram 6714_6#1.cram.bam
real 3m10.728s
user 3m3.043s
sys 0m4.652s

I then compared the differences. There *are* some, but these are restricted to nonsensical things (CIGAR strings for unmapped data) or ambiguities in the SAM specification (what exactly does TLEN really mean? everyone deals with it differently - leftmost/rightmost vs 5' ends).

There's a script in the io_lib tests subdirectory. It's not expected to be an end-user program so lacks documentation, but feel free to look at the source for the command line options. It needs SAM and not BAM.

Edit: running -notemplate 6714_6#1.sam 6714_6#1.cram.sam got 9899053 lines into the SAM files before detecting the first difference (ignoring TLEN diffs), which was an unmapped read having MD:Z:72T2 NM:i:1 tags. The .cram.sam file didn't have these as we auto-generate MD and NM on extraction, but obviously cannot do this for unmapped files. The difference was therefore due to a bug in the original aligner output.

[1] Obviously Vadim's Java (and the original) CRAM implementation is available at

jkbonfield 05-15-2013 01:38 AM


Originally Posted by narain (Post 104645)
But as I saw in one of the presentations, it seems CRAM does a lossy conversion from BAM, and introduces false positive and false negatives ? Is CRAM mature now to do a lossless compression from FASTQ and BAM files with random access such as BAM files give ?

I forgot to add, CRAM supports random access too. I have a cram_index program to create .crai files and then scramble can use these for random access. On a test I did recently it turned out that total number of seek and read system calls from random access within a cram file turned out to be fewer than it was on the analogous bam file.

This random access code hasn't been extensively tested yet, but it looks to be working in principle and is demonstrably efficient.

Finally, long term my C CRAM implementation will end up in samtools and/or HTSlib. I already have a fork of samtools that provides CRAM reading and writing support, but only via the samopen() unified interface rather than the SAM specific sam_open() call or BAM specific bam_open() call. Practically speaking this means samtools view works, but samtools pileup does not (as pileup won't work on SAM either). These are the issues that we will be addressing over the summer.

divon 07-22-2021 05:50 AM

You might want to try my program Genozip ( It is often better than CRAM.

divon 07-23-2021 08:13 PM

Thanks Andrey for the question. A few points where I think Genozip provides some benefits over CRAM:

1. Similar to CRAM, Genozip compresses each field of the SAM/BAM data separately, with the best codec for the particular type of data applied to each field. However, Genozip goes beyond that, and also leverages correlations *between* fields to further eliminate information redundancies. As a result, the compressed file is about 20% smaller than CRAM (according to our benchmark in the paper).

2. Genozip is not specific to SAM data - it can compress FASTQ, VCF and other genomic formats.

3. It is able to compress & archive whole directories directly into a tar file, eg: genozip *.bam --tar mydata.tar

4. It is highly scalable with cores - it has been tested to scale up to 100+ cores.

5. Genozip can compress BAM with or without a reference file, while CRAM requires a reference file. Compressing with a reference file in Genozip improves the compression ratio, in particular for low-coverage data, but for high-coverage data (eg 30x) Genozip can reach almost the same compression ratio without a reference file.

6. Genozip, through the command genocat, provides some interesting capabilities. Some of them similar to samtools, and some unique - for example, directly filtering out contamination from a BAM file using kraken2 data.

See the publication here:

And the software documentation here:

jindalashu434 08-02-2021 05:48 AM

Thanks for the prompt answer! mx player

All times are GMT -8. The time now is 01:23 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.