Seqanswers Leaderboard Ad

**jkbonfield** · 05-15-2013, 12:33 AM

CRAM has both lossy and lossless modes. My own C library currently only supports lossless encoding (but can handle decoding of lossily encoded CRAM files). Vadim's Java provides options for both lossy and lossless encoding.

As for maturity - I'd say it's pretty close now with CRAM v2.0. I'm biased of course[1], but try the latest Staden io_lib package and run the "scramble" command once built:

Staden Package - Browse /io_lib/1.13.1 at SourceForge.net

https://sourceforge.net/projects/staden/files/io_lib/1.13.1/

A fully developed set of DNA sequence assembly (Gap4 and Gap5), editing and analysis tools (Spin) for Unix, Linux, MacOSX and MS Windows.

Approx 1Gb bam file:
jkb[/tmp] ls -l 6714_6#1.bam
-rw-r--r-- 1 jkb team117 977124408 Apr 23 10:20 6714_6#1.bam

Locally specified reference (scramble will use the UR:file: field or access the EBI's MD5 server to pull down the reference automatically; otherwise use -r to specify the .fa location). Redacted slightly because I've no idea if this is public data or not.
jkb[/tmp] samtools view -H 6714_6#1.bam | egrep '^@SQ'
@SQ SN:<...> LN:2892523 UR:file:/nfs/srpipe_references/references/<...> M5:76f500<...>
<...>

Convert to CRAM losslessly, 38% less disk space used:
jkb[/tmp] time ./io_lib-1.13.1/progs/scramble 6714_6#1.bam 6714_6#1.cram
real 2m37.763s
user 2m31.753s
sys 0m3.564s
jkb@deskpro102485[/tmp] ls -l 6714_6#1.cram
-rw-r--r-- 1 jkb team117 608320844 May 15 09:23 6714_6#1.cram

Convert back to BAM again. "-m" indicates to generate MD and NM tags:
jkb@deskpro102485[/tmp] time ./io_lib-1.13.1/progs/scramble -m 6714_6#1.cram 6714_6#1.cram.bam
real 3m10.728s
user 3m3.043s
sys 0m4.652s

I then compared the differences. There *are* some, but these are restricted to nonsensical things (CIGAR strings for unmapped data) or ambiguities in the SAM specification (what exactly does TLEN really mean? everyone deals with it differently - leftmost/rightmost vs 5' ends).

There's a compare_sam.pl script in the io_lib tests subdirectory. It's not expected to be an end-user program so lacks documentation, but feel free to look at the source for the command line options. It needs SAM and not BAM.

Edit: running compare_sam.pl -notemplate 6714_6#1.sam 6714_6#1.cram.sam got 9899053 lines into the SAM files before detecting the first difference (ignoring TLEN diffs), which was an unmapped read having MD:Z:72T2 NM:i:1 tags. The .cram.sam file didn't have these as we auto-generate MD and NM on extraction, but obviously cannot do this for unmapped files. The difference was therefore due to a bug in the original aligner output.

[1] Obviously Vadim's Java (and the original) CRAM implementation is available at http://www.ebi.ac.uk/ena/about/cram_toolkit

**jkbonfield** · 05-15-2013, 12:38 AM

Originally posted by narain View Post

But as I saw in one of the presentations, it seems CRAM does a lossy conversion from BAM, and introduces false positive and false negatives ? Is CRAM mature now to do a lossless compression from FASTQ and BAM files with random access such as BAM files give ?

I forgot to add, CRAM supports random access too. I have a cram_index program to create .crai files and then scramble can use these for random access. On a test I did recently it turned out that total number of seek and read system calls from random access within a cram file turned out to be fewer than it was on the analogous bam file.

This random access code hasn't been extensively tested yet, but it looks to be working in principle and is demonstrably efficient.

Finally, long term my C CRAM implementation will end up in samtools and/or HTSlib. I already have a fork of samtools that provides CRAM reading and writing support, but only via the samopen() unified interface rather than the SAM specific sam_open() call or BAM specific bam_open() call. Practically speaking this means samtools view works, but samtools pileup does not (as pileup won't work on SAM either). These are the issues that we will be addressing over the summer.

**divon** · 07-22-2021, 04:50 AM

You might want to try my program Genozip (www.genozip.com). It is often better than CRAM.

**divon** · 07-23-2021, 07:13 PM

Thanks Andrey for the question. A few points where I think Genozip provides some benefits over CRAM:

1. Similar to CRAM, Genozip compresses each field of the SAM/BAM data separately, with the best codec for the particular type of data applied to each field. However, Genozip goes beyond that, and also leverages correlations *between* fields to further eliminate information redundancies. As a result, the compressed file is about 20% smaller than CRAM (according to our benchmark in the paper).

2. Genozip is not specific to SAM data - it can compress FASTQ, VCF and other genomic formats.

3. It is able to compress & archive whole directories directly into a tar file, eg: genozip *.bam --tar mydata.tar

4. It is highly scalable with cores - it has been tested to scale up to 100+ cores.

5. Genozip can compress BAM with or without a reference file, while CRAM requires a reference file. Compressing with a reference file in Genozip improves the compression ratio, in particular for low-coverage data, but for high-coverage data (eg 30x) Genozip can reach almost the same compression ratio without a reference file.

6. Genozip, through the command genocat, provides some interesting capabilities. Some of them similar to samtools, and some unique - for example, directly filtering out contamination from a BAM file using kraken2 data.

See the publication here: https://www.researchgate.net/publica...ata_Compressor

And the software documentation here: https://www.genozip.com

**jindalashu434** · 08-02-2021, 04:48 AM

Thanks for the prompt answer! mx player

**divon** · 12-08-2021, 03:59 AM

Some new Genozip benchmarks:

Benchmarks | Genozip

https://genozip.com/benchmarks.html

Genozip is the best compressor for FASTQ, BAM and VCF data, for a wide range of cases. See some benchmarks.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 49 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 50 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 43 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 55 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News