View Single Post
Old 12-07-2016, 11:35 AM   #12
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

I ran some benchmarks on 100x NextSeq E.coli data, to compare file sizes under various conditions:



This shows the file size, in bytes. Clumpified data is almost as small as mapped, sorted data, but takes much less time. The exact sizes were:
Code:
100x.fq.gz	360829483
clumped.fq.gz	251014934
That's a 30.4% reduction. Note that this was for NextSeq data without binned quality scores. When the quality scores are binned (as is the default for NextSeq) the increase in compression is even greater:

Code:
100x_binned.fq.gz	267955329
clumped_binned.fq.gz	161766626
...a 39.6% reduction. I don't recommend quality-score binning, though Clumpify does have the option of doing so (with the quantize flag).



This is the script I used to generate these sizes and times:
Code:
time clumpify.sh in=100x.fq.gz out=clumped_noreorder.fq.gz
time clumpify.sh in=100x.fq.gz out=clumped.fq.gz reorder
time clumpify.sh in=100x.fq.gz out=clumped_lowram.fq.gz -Xmx1g
time clumpify.sh in=100x.fq.gz out=clumped.fq.bz2 reorder
time reformat.sh in=100x.fq.gz out=100x.fq.bz2
time bbmap.sh in=100x.fq.gz ref=ecoli_K12.fa.gz out=mapped.bam bs=bs.sh; time sh bs.sh
reformat.sh in=mapped_sorted.bam out=sorted.fq.gz zl=6
reformat.sh in=mapped_sorted.bam out=sorted.sam.gz zl=6
reformat.sh in=mapped_sorted.bam out=sorted.fq.bz2 zl=6
Attached Images
File Type: png clump_size.png (15.5 KB, 217 views)
File Type: png clump_time.png (9.6 KB, 217 views)
Brian Bushnell is offline   Reply With Quote