SEQanswers

Go Back   SEQanswers > General



Similar Threads
Thread Thread Starter Forum Replies Last Post
Stand Alone Bam to FASTQ dcfargo Bioinformatics 14 08-20-2021 12:47 AM
Fastq data compression with pigz darked89 Bioinformatics 5 07-22-2021 04:56 AM
Convert merged BAM back to per lane BAM or FASTQ file danielsbrewer Bioinformatics 6 10-03-2013 07:29 AM
Reverse engineering BAM files: BAM -> FASTQ gene coder Bioinformatics 3 01-03-2012 02:42 PM
Fastq compression - proof of concept jkbonfield Bioinformatics 6 08-10-2010 03:12 PM

Reply
 
Thread Tools
Old 01-09-2013, 07:05 AM   #1
tir_al
Member
 
Location: Croatia

Join Date: Sep 2010
Posts: 22
Default FastQ/BAM compression

Does anybody know of a more recent comparison of algorithms for fastq/bam compression, than this thread?
http://seqanswers.com/forums/showthread.php?t=6349

Best
tir_al is offline   Reply With Quote
Old 01-09-2013, 07:10 AM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,125
Default

CRAM is reference based compression so may or may not be of interest for you. http://www.ebi.ac.uk/ena/about/cram_toolkit/
GenoMax is offline   Reply With Quote
Old 01-09-2013, 07:34 AM   #3
tir_al
Member
 
Location: Croatia

Join Date: Sep 2010
Posts: 22
Default

Tnx for the prompt answer!

I just tried cram today. The compression ratio is extremely impressive, but it's a too slow for my needs.
tir_al is offline   Reply With Quote
Old 01-09-2013, 01:20 PM   #4
winsettz
Member
 
Location: US

Join Date: Sep 2012
Posts: 91
Default

tir_al,

What is the compression ratio? Curious to see before I take a dive with my own data.
winsettz is offline   Reply With Quote
Old 01-09-2013, 01:26 PM   #5
tir_al
Member
 
Location: Croatia

Join Date: Sep 2010
Posts: 22
Default

I tried it on 75bp, cca 80 mil. reads, pair end bam file. And it crammed 3.6 GB file into a 257MB archive.
tir_al is offline   Reply With Quote
Old 01-09-2013, 03:23 PM   #6
winsettz
Member
 
Location: US

Join Date: Sep 2012
Posts: 91
Default

Quote:
Originally Posted by tir_al View Post
I tried it on 75bp, cca 80 mil. reads, pair end bam file. And it crammed 3.6 GB file into a 257MB archive.
Sounds like a promising way to store genomic data in the long run if indexed to hg19?
winsettz is offline   Reply With Quote
Old 01-09-2013, 03:25 PM   #7
tir_al
Member
 
Location: Croatia

Join Date: Sep 2010
Posts: 22
Default

Yeah. Preferably for storing old projects.
tir_al is offline   Reply With Quote
Old 01-10-2013, 05:18 AM   #8
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,125
Default

Quote:
Originally Posted by tir_al View Post
Yeah. Preferably for storing old projects.
Are you sure the effort is going to be worthwhile rather than using plain old tar/gzip combination?

If you are looking at thousands of samples a year then perhaps it may be.
GenoMax is offline   Reply With Quote
Old 01-10-2013, 05:50 AM   #9
tir_al
Member
 
Location: Croatia

Join Date: Sep 2010
Posts: 22
Default

I currently have no options, and no place for new disk space
tir_al is offline   Reply With Quote
Old 01-11-2013, 01:41 AM   #10
bruce01
Senior Member
 
Location: .

Join Date: Mar 2011
Posts: 157
Default

Hi all, I can't figure out how to specify loseless compression using cramTools (ie retain all quality score info), can someone help me out? In NGCs paper they state a few flags which seem to be discontinued in 1.0. Presumably I specify using --lossy-quality-score-spec but I can't figure out how to set it to 'any/all'. I appreciate any help/ideas on this. Also if I am missing the point and the compression inherently removes quality scores I apologise in advance, I am n00 to the area=P

Last edited by bruce01; 01-11-2013 at 01:55 AM.
bruce01 is offline   Reply With Quote
Old 01-16-2013, 12:46 AM   #11
priesgo
Member
 
Location: Spain

Join Date: Aug 2012
Posts: 22
Default

Hi,

I'm in the same situation as Bruce. I want to compress keeping the base call qualities but can't figure out how...

Just did a naf try:
Code:
--lossy-quality-score-spec all
and got:
Code:
Exception in thread "main" java.lang.RuntimeException: Uknown read or base category: a
        at net.sf.cram.lossy.QualityScorePreservation.parseSinglePolicy(QualityScorePreservation.java:138)
So apparently you can specify a read name to keep the quality, not of use in my case as I want to keep all of them; and base category. But, what is the base category? I also tried a numeric value in case it referred referred to an index in the read, but with similar result.


Thanks!
Pablo.
priesgo is offline   Reply With Quote
Old 01-16-2013, 01:44 AM   #12
bruce01
Senior Member
 
Location: .

Join Date: Mar 2011
Posts: 157
Default

Hi Pablo,

found it on the archives of cram mailing list, the call includes: -L m999 (-L flag is your --lossy-quality-score-spec above). All reads are retained, but you lose column 12+ ('info'). This isn't an issue for me, and the compressed cram file for a 1GB bam is 600MB which is pretty good!
bruce01 is offline   Reply With Quote
Old 01-16-2013, 01:59 AM   #13
priesgo
Member
 
Location: Spain

Join Date: Aug 2012
Posts: 22
Default

Thanks Bruce,

It's running!
For the columns 12+ I guess with the option --capture-tags you can keep tags as the read group which is usually important.


Regards,
Pablo.
priesgo is offline   Reply With Quote
Old 02-04-2013, 12:38 AM   #14
jkbonfield
Senior Member
 
Location: Cambridge, UK

Join Date: Jul 2008
Posts: 146
Default

I just noticed this thread, rather late.

There is CRAM from EBI, which has long term support and handles random access. It's the most direct competitor to BAM I would guess.

Alternatives are Goby (similar ratios, but even slower from my experience), Quip (faster encoding, great compression ratio, but no(?) random access) and SamComp1/2 (faster encoding, great compression ratio, no random access, and doesn't really implement the full SAM spec - more of a fastq compressor). Finally on that topic there are tools like quip again, fqzcomp and fastqz for compression of FASTQ data. [All 3 of these were SequenceSqueeze competition entries.]
jkbonfield is offline   Reply With Quote
Old 05-14-2013, 10:53 AM   #15
narain
Member
 
Location: Washington DC

Join Date: Aug 2011
Posts: 78
Default

But as I saw in one of the presentations, it seems CRAM does a lossy conversion from BAM, and introduces false positive and false negatives ? Is CRAM mature now to do a lossless compression from FASTQ and BAM files with random access such as BAM files give ?
narain is offline   Reply With Quote
Old 05-15-2013, 12:33 AM   #16
jkbonfield
Senior Member
 
Location: Cambridge, UK

Join Date: Jul 2008
Posts: 146
Default

CRAM has both lossy and lossless modes. My own C library currently only supports lossless encoding (but can handle decoding of lossily encoded CRAM files). Vadim's Java provides options for both lossy and lossless encoding.

As for maturity - I'd say it's pretty close now with CRAM v2.0. I'm biased of course[1], but try the latest Staden io_lib package and run the "scramble" command once built:

https://sourceforge.net/projects/sta...io_lib/1.13.1/

Approx 1Gb bam file:
jkb[/tmp] ls -l 6714_6#1.bam
-rw-r--r-- 1 jkb team117 977124408 Apr 23 10:20 6714_6#1.bam

Locally specified reference (scramble will use the UR:file: field or access the EBI's MD5 server to pull down the reference automatically; otherwise use -r to specify the .fa location). Redacted slightly because I've no idea if this is public data or not.
jkb[/tmp] samtools view -H 6714_6#1.bam | egrep '^@SQ'
@SQ SN:<...> LN:2892523 UR:file:/nfs/srpipe_references/references/<...> M5:76f500<...>
<...>

Convert to CRAM losslessly, 38% less disk space used:
jkb[/tmp] time ./io_lib-1.13.1/progs/scramble 6714_6#1.bam 6714_6#1.cram
real 2m37.763s
user 2m31.753s
sys 0m3.564s
jkb@deskpro102485[/tmp] ls -l 6714_6#1.cram
-rw-r--r-- 1 jkb team117 608320844 May 15 09:23 6714_6#1.cram

Convert back to BAM again. "-m" indicates to generate MD and NM tags:
jkb@deskpro102485[/tmp] time ./io_lib-1.13.1/progs/scramble -m 6714_6#1.cram 6714_6#1.cram.bam
real 3m10.728s
user 3m3.043s
sys 0m4.652s

I then compared the differences. There *are* some, but these are restricted to nonsensical things (CIGAR strings for unmapped data) or ambiguities in the SAM specification (what exactly does TLEN really mean? everyone deals with it differently - leftmost/rightmost vs 5' ends).

There's a compare_sam.pl script in the io_lib tests subdirectory. It's not expected to be an end-user program so lacks documentation, but feel free to look at the source for the command line options. It needs SAM and not BAM.

Edit: running compare_sam.pl -notemplate 6714_6#1.sam 6714_6#1.cram.sam got 9899053 lines into the SAM files before detecting the first difference (ignoring TLEN diffs), which was an unmapped read having MD:Z:72T2 NM:i:1 tags. The .cram.sam file didn't have these as we auto-generate MD and NM on extraction, but obviously cannot do this for unmapped files. The difference was therefore due to a bug in the original aligner output.

[1] Obviously Vadim's Java (and the original) CRAM implementation is available at http://www.ebi.ac.uk/ena/about/cram_toolkit

Last edited by jkbonfield; 05-15-2013 at 12:42 AM.
jkbonfield is offline   Reply With Quote
Old 05-15-2013, 12:38 AM   #17
jkbonfield
Senior Member
 
Location: Cambridge, UK

Join Date: Jul 2008
Posts: 146
Default

Quote:
Originally Posted by narain View Post
But as I saw in one of the presentations, it seems CRAM does a lossy conversion from BAM, and introduces false positive and false negatives ? Is CRAM mature now to do a lossless compression from FASTQ and BAM files with random access such as BAM files give ?
I forgot to add, CRAM supports random access too. I have a cram_index program to create .crai files and then scramble can use these for random access. On a test I did recently it turned out that total number of seek and read system calls from random access within a cram file turned out to be fewer than it was on the analogous bam file.

This random access code hasn't been extensively tested yet, but it looks to be working in principle and is demonstrably efficient.

Finally, long term my C CRAM implementation will end up in samtools and/or HTSlib. I already have a fork of samtools that provides CRAM reading and writing support, but only via the samopen() unified interface rather than the SAM specific sam_open() call or BAM specific bam_open() call. Practically speaking this means samtools view works, but samtools pileup does not (as pileup won't work on SAM either). These are the issues that we will be addressing over the summer.
jkbonfield is offline   Reply With Quote
Old 07-22-2021, 04:50 AM   #18
divon
Junior Member
 
Location: Australia

Join Date: Jul 2021
Posts: 8
Default

You might want to try my program Genozip (www.genozip.com). It is often better than CRAM.
divon is offline   Reply With Quote
Old 07-23-2021, 07:13 PM   #19
divon
Junior Member
 
Location: Australia

Join Date: Jul 2021
Posts: 8
Default

Thanks Andrey for the question. A few points where I think Genozip provides some benefits over CRAM:

1. Similar to CRAM, Genozip compresses each field of the SAM/BAM data separately, with the best codec for the particular type of data applied to each field. However, Genozip goes beyond that, and also leverages correlations *between* fields to further eliminate information redundancies. As a result, the compressed file is about 20% smaller than CRAM (according to our benchmark in the paper).

2. Genozip is not specific to SAM data - it can compress FASTQ, VCF and other genomic formats.

3. It is able to compress & archive whole directories directly into a tar file, eg: genozip *.bam --tar mydata.tar

4. It is highly scalable with cores - it has been tested to scale up to 100+ cores.

5. Genozip can compress BAM with or without a reference file, while CRAM requires a reference file. Compressing with a reference file in Genozip improves the compression ratio, in particular for low-coverage data, but for high-coverage data (eg 30x) Genozip can reach almost the same compression ratio without a reference file.

6. Genozip, through the command genocat, provides some interesting capabilities. Some of them similar to samtools, and some unique - for example, directly filtering out contamination from a BAM file using kraken2 data.

See the publication here: https://www.researchgate.net/publica...ata_Compressor

And the software documentation here: https://www.genozip.com

Last edited by divon; 07-23-2021 at 07:27 PM.
divon is offline   Reply With Quote
Old 08-02-2021, 04:48 AM   #20
jindalashu434
Junior Member
 
Location: TEXAS

Join Date: Aug 2021
Posts: 1
Default

Thanks for the prompt answer! mx player

Last edited by jindalashu434; 08-03-2021 at 12:34 AM.
jindalashu434 is offline   Reply With Quote
Reply

Tags
bam, compression, fastq

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:08 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO