SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Using RTA to re-process images. pmiguel Illumina/Solexa 17 08-29-2012 04:33 AM
Savin Images share Illumina/Solexa 2 03-09-2012 02:59 AM
compressing reference genome and indexing splaisan Bioinformatics 4 01-31-2012 12:10 AM
Streaks in flowcell images agent.hm Illumina/Solexa 1 07-21-2011 08:02 AM
Has anyone else seen strange images in their tiles? james hadfield General 2 08-18-2010 12:41 PM

Reply
 
Thread Tools
Old 02-18-2010, 06:49 AM   #1
jorgebm
Member
 
Location: Spain

Join Date: Feb 2010
Posts: 18
Default Recomendations to compressing images (TIFF)

Hello,

I've to store (backup) images from GAIIx to an external device (by the moment to a USB HD, for the future probably to a tape LTO). GAIIx produces uncompressed TIFF images and I'm thinking to use tiffcp linux tool to performs LZW compression (lossless) before.

I'm absolutely newbie so, has anyone any recomendation to me about this matter?

Thanks in advance.
jorgebm is offline   Reply With Quote
Old 02-18-2010, 07:24 AM   #2
drio
Senior Member
 
Location: 4117'49"N / 24'42"E

Join Date: Oct 2008
Posts: 323
Default

Group the images dir in a tarball and then use bzip2 (you'll get better compression rates at cpu cost). You can use pbzip2 which is a multithread implementation.

Something like this:

$ tar cf images.tar.bz2 --use-compress-prog=pbzip2 images_dir_to_compress/

Also, you may want to compute the MD5 hash of the tarball and copy it over to your drive so you can later check for the integrity of the file.
__________________
-drd
drio is offline   Reply With Quote
Old 02-18-2010, 11:35 PM   #3
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

We looked at doing this, but compressing that much data takes a lot of compute power (especially for high compression levels) and the space savings aren't all that spectacular.

Highest level zip compression reduced file size by just under 30%. Bzip2 is better and reduced file size by 45%. I'd suggest compressing one lane to see how much of an impact it has on your server and then weigh that against the space savings.

If you're transferring data to an external drive you'll also need to look at what filesystem it's using. Many portable drives still use fat32 for compatibility. If you create large archives you'll soon hit the 4GB limit for the size of individual files. Using a better filesystem will fix this, but you may have problems moving the data between different OSs.
simonandrews is offline   Reply With Quote
Old 02-19-2010, 09:51 AM   #4
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by simonandrews View Post
We looked at doing this, but compressing that much data takes a lot of compute power (especially for high compression levels) and the space savings aren't all that spectacular.

Highest level zip compression reduced file size by just under 30%. Bzip2 is better and reduced file size by 45%. I'd suggest compressing one lane to see how much of an impact it has on your server and then weigh that against the space savings.

If you're transferring data to an external drive you'll also need to look at what filesystem it's using. Many portable drives still use fat32 for compatibility. If you create large archives you'll soon hit the 4GB limit for the size of individual files. Using a better filesystem will fix this, but you may have problems moving the data between different OSs.
Also consider the scenario where you must re-analyze the data. Some software tools (aligners etc.) can natively read in bz2 or gz files while others cannot. Decompression may be a pain if the compression format is not supported by your tools.
nilshomer is offline   Reply With Quote
Old 02-22-2010, 07:56 PM   #5
spenthil
Member
 
Location: San Francisco

Join Date: Sep 2009
Posts: 44
Default

Ocarina/NetApp (dedupe'd storage), if you have the money to throw around.
__________________
--
Senthil Palanisami
spenthil is offline   Reply With Quote
Old 02-23-2010, 12:09 AM   #6
jorgebm
Member
 
Location: Spain

Join Date: Feb 2010
Posts: 18
Default

I've fetched some useful ideas from your posts. Thank you.

In order to keep clean this thread I'm going to open another one, asking for suggestions about which storage technologies (Tape, SAN,etc...) are using sequencing labs to undertake image backup.
jorgebm is offline   Reply With Quote
Old 02-23-2010, 12:15 AM   #7
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by jorgebm View Post
I've fetched some useful ideas from your posts. Thank you.

In order to keep clean this thread I'm going to open another one, asking for suggestions about which storage technologies (Tape, SAN,etc...) are using sequencing labs to undertake image backup.
Have you considered deleting the images? SRA does not require them for submission and the only program I can think of that uses images (beyond Illumina's basecaller) is the bisulfite alignment tool (Cokus et al Nature 2008). It might be a good idea to buffer the data, for example by storing the last 1 months images, but beyond that is there a specific purpose?

Last edited by nilshomer; 02-23-2010 at 12:15 AM. Reason: clarity
nilshomer is offline   Reply With Quote
Old 02-23-2010, 12:32 AM   #8
jkbonfield
Senior Member
 
Location: Cambridge, UK

Join Date: Jul 2008
Posts: 146
Default

I'd second the thought of just having a pipeline with just-in-time deletion. As you run out of space, remove the older images around. That way you can keep data for as long as is feasible, pruning it only when you absolutely have to.

If you really want to archive them, then you need to way up the cost of cpu time and/or the cost of commercial solutions (eg ocarina) vs tape costs. The images don't compress well simply because they have a lot of noise. While it's possible to write dedicated compressors to model the signal vs the noise to try and improve on compression ratios, you still won't get particularly good compression rates unless you want to move to a lossy compression system as the amount of noise is quite high.

The "ideal" lossy compression would start with a system which extracts as much as the signal as it can while leaving as much as the noise behind, and then compresses that signal using the usual techniques. One could argue that a decent first stab at this tool already exists - it's the illumina Firecrest program coupled to compression of the output (eg gzip, bzip2, or SRF). Furthermore it is this data that the trace archives, SRA and ERA, want to receive.

James
jkbonfield is offline   Reply With Quote
Old 02-25-2010, 03:26 AM   #9
james hadfield
Moderator
Cambridge, UK
Community Forum
 
Location: Cambridge, UK

Join Date: Feb 2008
Posts: 221
Default

Hi Jorge,
I'd agree with the above posts about deleting images. Do you want to keep them for further analysis?
The latests SCS allows you to save all, or a subset of, tiles as thumbnails. These are really nice images for troubleshooting and we have stopped keeping images even for a few days. The RTA deletes them once it has completed analysing intensities.
James.
james hadfield is offline   Reply With Quote
Old 03-02-2010, 06:42 AM   #10
jorgebm
Member
 
Location: Spain

Join Date: Feb 2010
Posts: 18
Default

Quote:
Originally Posted by james hadfield View Post
Hi Jorge,
I'd agree with the above posts about deleting images. Do you want to keep them for further analysis?
The latests SCS allows you to save all, or a subset of, tiles as thumbnails. These are really nice images for troubleshooting and we have stopped keeping images even for a few days. The RTA deletes them once it has completed analysing intensities.
James.
Quote:
Originally Posted by nilshomer View Post
Have you considered deleting the images? SRA does not require them for submission and the only program I can think of that uses images (beyond Illumina's basecaller) is the bisulfite alignment tool (Cokus et al Nature 2008). It might be a good idea to buffer the data, for example by storing the last 1 months images, but beyond that is there a specific purpose?
Actually, keeping-images policy isn't my personal decision. We just started sequencing recently (currently we're going to perform our first run) and I think it's a conservative position. I suppose reality (time and money cost) will force us to adapt this policy in terms of available resources and needs.
jorgebm is offline   Reply With Quote
Old 03-02-2010, 07:15 AM   #11
jorgebm
Member
 
Location: Spain

Join Date: Feb 2010
Posts: 18
Default

Quote:
Originally Posted by jkbonfield View Post
I'd second the thought of just having a pipeline with just-in-time deletion. As you run out of space, remove the older images around. That way you can keep data for as long as is feasible, pruning it only when you absolutely have to.

If you really want to archive them, then you need to way up the cost of cpu time and/or the cost of commercial solutions (eg ocarina) vs tape costs. The images don't compress well simply because they have a lot of noise. While it's possible to write dedicated compressors to model the signal vs the noise to try and improve on compression ratios, you still won't get particularly good compression rates unless you want to move to a lossy compression system as the amount of noise is quite high.

The "ideal" lossy compression would start with a system which extracts as much as the signal as it can while leaving as much as the noise behind, and then compresses that signal using the usual techniques. One could argue that a decent first stab at this tool already exists - it's the illumina Firecrest program coupled to compression of the output (eg gzip, bzip2, or SRF). Furthermore it is this data that the trace archives, SRA and ERA, want to receive.

James
Probably (really, sure) your suggestion to delete old images when run out of space It's the best cost-effective solution but, as I said above, It's not my decision. I could suggest upwards but not decide the policy entirely by my own. By the moment, keep images is a must.....later, probably we have to face a realistic (and pragmatic) policy.

However, all your coments are useful to me in the way that helps me to collect the practices of most experienced groups.

Thank's all of you
jorgebm is offline   Reply With Quote
Old 03-02-2010, 11:55 AM   #12
drio
Senior Member
 
Location: 4117'49"N / 24'42"E

Join Date: Oct 2008
Posts: 323
Default

Quote:
Originally Posted by jorgebm View Post
Probably (really, sure) your suggestion to delete old images when run out of space It's the best cost-effective solution but, as I said above, It's not my decision. I could suggest upwards but not decide the policy entirely by my own. By the moment, keep images is a must.....later, probably we have to face a realistic (and pragmatic) policy.

However, all your coments are useful to me in the way that helps me to collect the practices of most experienced groups.

Thank's all of you
Keeping images, IMMO, it is not an option.
You should keep all the intensities (much smaller) until you have QA your Run. After that you should remove it and keep only the raw reads + qualities in a single BAM file. You can also keep the summary.(html|xml) and the associated pngs plots.
__________________
-drd
drio is offline   Reply With Quote
Old 03-03-2010, 04:03 AM   #13
NGSfan
Senior Member
 
Location: Austria

Join Date: Apr 2009
Posts: 181
Default

Keeping images only makes sense if you plan to reanalyze them with other software (SWIFT or the upcoming next phred).
NGSfan is offline   Reply With Quote
Reply

Tags
backup, images, tiff

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:56 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO