Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Recomendations to compressing images (TIFF)

    Hello,

    I've to store (backup) images from GAIIx to an external device (by the moment to a USB HD, for the future probably to a tape LTO). GAIIx produces uncompressed TIFF images and I'm thinking to use tiffcp linux tool to performs LZW compression (lossless) before.

    I'm absolutely newbie so, has anyone any recomendation to me about this matter?

    Thanks in advance.

  • #2
    Group the images dir in a tarball and then use bzip2 (you'll get better compression rates at cpu cost). You can use pbzip2 which is a multithread implementation.

    Something like this:

    $ tar cf images.tar.bz2 --use-compress-prog=pbzip2 images_dir_to_compress/

    Also, you may want to compute the MD5 hash of the tarball and copy it over to your drive so you can later check for the integrity of the file.
    -drd

    Comment


    • #3
      We looked at doing this, but compressing that much data takes a lot of compute power (especially for high compression levels) and the space savings aren't all that spectacular.

      Highest level zip compression reduced file size by just under 30%. Bzip2 is better and reduced file size by 45%. I'd suggest compressing one lane to see how much of an impact it has on your server and then weigh that against the space savings.

      If you're transferring data to an external drive you'll also need to look at what filesystem it's using. Many portable drives still use fat32 for compatibility. If you create large archives you'll soon hit the 4GB limit for the size of individual files. Using a better filesystem will fix this, but you may have problems moving the data between different OSs.

      Comment


      • #4
        Originally posted by simonandrews View Post
        We looked at doing this, but compressing that much data takes a lot of compute power (especially for high compression levels) and the space savings aren't all that spectacular.

        Highest level zip compression reduced file size by just under 30%. Bzip2 is better and reduced file size by 45%. I'd suggest compressing one lane to see how much of an impact it has on your server and then weigh that against the space savings.

        If you're transferring data to an external drive you'll also need to look at what filesystem it's using. Many portable drives still use fat32 for compatibility. If you create large archives you'll soon hit the 4GB limit for the size of individual files. Using a better filesystem will fix this, but you may have problems moving the data between different OSs.
        Also consider the scenario where you must re-analyze the data. Some software tools (aligners etc.) can natively read in bz2 or gz files while others cannot. Decompression may be a pain if the compression format is not supported by your tools.

        Comment


        • #5
          Ocarina/NetApp (dedupe'd storage), if you have the money to throw around.
          --
          Senthil Palanisami

          Comment


          • #6
            I've fetched some useful ideas from your posts. Thank you.

            In order to keep clean this thread I'm going to open another one, asking for suggestions about which storage technologies (Tape, SAN,etc...) are using sequencing labs to undertake image backup.

            Comment


            • #7
              Originally posted by jorgebm View Post
              I've fetched some useful ideas from your posts. Thank you.

              In order to keep clean this thread I'm going to open another one, asking for suggestions about which storage technologies (Tape, SAN,etc...) are using sequencing labs to undertake image backup.
              Have you considered deleting the images? SRA does not require them for submission and the only program I can think of that uses images (beyond Illumina's basecaller) is the bisulfite alignment tool (Cokus et al Nature 2008). It might be a good idea to buffer the data, for example by storing the last 1 months images, but beyond that is there a specific purpose?
              Last edited by nilshomer; 02-23-2010, 01:15 AM. Reason: clarity

              Comment


              • #8
                I'd second the thought of just having a pipeline with just-in-time deletion. As you run out of space, remove the older images around. That way you can keep data for as long as is feasible, pruning it only when you absolutely have to.

                If you really want to archive them, then you need to way up the cost of cpu time and/or the cost of commercial solutions (eg ocarina) vs tape costs. The images don't compress well simply because they have a lot of noise. While it's possible to write dedicated compressors to model the signal vs the noise to try and improve on compression ratios, you still won't get particularly good compression rates unless you want to move to a lossy compression system as the amount of noise is quite high.

                The "ideal" lossy compression would start with a system which extracts as much as the signal as it can while leaving as much as the noise behind, and then compresses that signal using the usual techniques. One could argue that a decent first stab at this tool already exists - it's the illumina Firecrest program coupled to compression of the output (eg gzip, bzip2, or SRF). Furthermore it is this data that the trace archives, SRA and ERA, want to receive.

                James

                Comment


                • #9
                  Hi Jorge,
                  I'd agree with the above posts about deleting images. Do you want to keep them for further analysis?
                  The latests SCS allows you to save all, or a subset of, tiles as thumbnails. These are really nice images for troubleshooting and we have stopped keeping images even for a few days. The RTA deletes them once it has completed analysing intensities.
                  James.

                  Comment


                  • #10
                    Originally posted by james hadfield View Post
                    Hi Jorge,
                    I'd agree with the above posts about deleting images. Do you want to keep them for further analysis?
                    The latests SCS allows you to save all, or a subset of, tiles as thumbnails. These are really nice images for troubleshooting and we have stopped keeping images even for a few days. The RTA deletes them once it has completed analysing intensities.
                    James.
                    Originally posted by nilshomer View Post
                    Have you considered deleting the images? SRA does not require them for submission and the only program I can think of that uses images (beyond Illumina's basecaller) is the bisulfite alignment tool (Cokus et al Nature 2008). It might be a good idea to buffer the data, for example by storing the last 1 months images, but beyond that is there a specific purpose?
                    Actually, keeping-images policy isn't my personal decision. We just started sequencing recently (currently we're going to perform our first run) and I think it's a conservative position. I suppose reality (time and money cost) will force us to adapt this policy in terms of available resources and needs.

                    Comment


                    • #11
                      Originally posted by jkbonfield View Post
                      I'd second the thought of just having a pipeline with just-in-time deletion. As you run out of space, remove the older images around. That way you can keep data for as long as is feasible, pruning it only when you absolutely have to.

                      If you really want to archive them, then you need to way up the cost of cpu time and/or the cost of commercial solutions (eg ocarina) vs tape costs. The images don't compress well simply because they have a lot of noise. While it's possible to write dedicated compressors to model the signal vs the noise to try and improve on compression ratios, you still won't get particularly good compression rates unless you want to move to a lossy compression system as the amount of noise is quite high.

                      The "ideal" lossy compression would start with a system which extracts as much as the signal as it can while leaving as much as the noise behind, and then compresses that signal using the usual techniques. One could argue that a decent first stab at this tool already exists - it's the illumina Firecrest program coupled to compression of the output (eg gzip, bzip2, or SRF). Furthermore it is this data that the trace archives, SRA and ERA, want to receive.

                      James
                      Probably (really, sure) your suggestion to delete old images when run out of space It's the best cost-effective solution but, as I said above, It's not my decision. I could suggest upwards but not decide the policy entirely by my own. By the moment, keep images is a must.....later, probably we have to face a realistic (and pragmatic) policy.

                      However, all your coments are useful to me in the way that helps me to collect the practices of most experienced groups.

                      Thank's all of you

                      Comment


                      • #12
                        Originally posted by jorgebm View Post
                        Probably (really, sure) your suggestion to delete old images when run out of space It's the best cost-effective solution but, as I said above, It's not my decision. I could suggest upwards but not decide the policy entirely by my own. By the moment, keep images is a must.....later, probably we have to face a realistic (and pragmatic) policy.

                        However, all your coments are useful to me in the way that helps me to collect the practices of most experienced groups.

                        Thank's all of you
                        Keeping images, IMMO, it is not an option.
                        You should keep all the intensities (much smaller) until you have QA your Run. After that you should remove it and keep only the raw reads + qualities in a single BAM file. You can also keep the summary.(html|xml) and the associated pngs plots.
                        -drd

                        Comment


                        • #13
                          Keeping images only makes sense if you plan to reanalyze them with other software (SWIFT or the upcoming next phred).

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Essential Discoveries and Tools in Epitranscriptomics
                            by seqadmin




                            The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                            04-22-2024, 07:01 AM
                          • seqadmin
                            Current Approaches to Protein Sequencing
                            by seqadmin


                            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                            04-04-2024, 04:25 PM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, Today, 11:49 AM
                          0 responses
                          13 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, Yesterday, 08:47 AM
                          0 responses
                          16 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-11-2024, 12:08 PM
                          0 responses
                          61 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 10:19 PM
                          0 responses
                          60 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X