View Single Post
Old 05-15-2013, 01:33 AM   #16
jkbonfield
Senior Member
 
Location: Cambridge, UK

Join Date: Jul 2008
Posts: 146
Default

CRAM has both lossy and lossless modes. My own C library currently only supports lossless encoding (but can handle decoding of lossily encoded CRAM files). Vadim's Java provides options for both lossy and lossless encoding.

As for maturity - I'd say it's pretty close now with CRAM v2.0. I'm biased of course[1], but try the latest Staden io_lib package and run the "scramble" command once built:

https://sourceforge.net/projects/sta...io_lib/1.13.1/

Approx 1Gb bam file:
jkb[/tmp] ls -l 6714_6#1.bam
-rw-r--r-- 1 jkb team117 977124408 Apr 23 10:20 6714_6#1.bam

Locally specified reference (scramble will use the UR:file: field or access the EBI's MD5 server to pull down the reference automatically; otherwise use -r to specify the .fa location). Redacted slightly because I've no idea if this is public data or not.
jkb[/tmp] samtools view -H 6714_6#1.bam | egrep '^@SQ'
@SQ SN:<...> LN:2892523 UR:file:/nfs/srpipe_references/references/<...> M5:76f500<...>
<...>

Convert to CRAM losslessly, 38% less disk space used:
jkb[/tmp] time ./io_lib-1.13.1/progs/scramble 6714_6#1.bam 6714_6#1.cram
real 2m37.763s
user 2m31.753s
sys 0m3.564s
jkb@deskpro102485[/tmp] ls -l 6714_6#1.cram
-rw-r--r-- 1 jkb team117 608320844 May 15 09:23 6714_6#1.cram

Convert back to BAM again. "-m" indicates to generate MD and NM tags:
jkb@deskpro102485[/tmp] time ./io_lib-1.13.1/progs/scramble -m 6714_6#1.cram 6714_6#1.cram.bam
real 3m10.728s
user 3m3.043s
sys 0m4.652s

I then compared the differences. There *are* some, but these are restricted to nonsensical things (CIGAR strings for unmapped data) or ambiguities in the SAM specification (what exactly does TLEN really mean? everyone deals with it differently - leftmost/rightmost vs 5' ends).

There's a compare_sam.pl script in the io_lib tests subdirectory. It's not expected to be an end-user program so lacks documentation, but feel free to look at the source for the command line options. It needs SAM and not BAM.

Edit: running compare_sam.pl -notemplate 6714_6#1.sam 6714_6#1.cram.sam got 9899053 lines into the SAM files before detecting the first difference (ignoring TLEN diffs), which was an unmapped read having MD:Z:72T2 NM:i:1 tags. The .cram.sam file didn't have these as we auto-generate MD and NM on extraction, but obviously cannot do this for unmapped files. The difference was therefore due to a bug in the original aligner output.

[1] Obviously Vadim's Java (and the original) CRAM implementation is available at http://www.ebi.ac.uk/ena/about/cram_toolkit

Last edited by jkbonfield; 05-15-2013 at 01:42 AM.
jkbonfield is offline   Reply With Quote