Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Illumina MiSeq file size/downstream analysis question

    At my lab We are starting to organize all of the infrastructure we will need in our lab for bringing in NGS. We will be doing a 15kb panel on the MiSeq using v3 reagents. We will be generating ~10-15Gb of sequence per run.

    Our downstream analysis will be in CLC Biogenomics workbench. It is my understanding that we will demultiplex our MiSeq files, import them into the workbench software on our custom tower, and process from there.

    Is there anyone who has experience with the workflow of CLC genomics workbench from the Illumina platforms? Our analysis computer will have ~4TB of storage, and we were thinking of obtaining ~10-15 TB of network storage.

    In addition to the 15Gb of MiSeq data, is there any way to estimate the size and number of files that we will generate in CLC while we work towards the final VCF?

    Sorry for the long winded question. Any information will help greatly

  • #2
    How many runs do you expect to do each month/over a year? Have you thought about a long term archival storage solution (or don't expect a need for that)? Are you going to use on-board software on MiSeq to do the demultiplexing or would BaseSpace be in play?

    Comment


    • #3
      We will probably be performing 4-5 runs a month, with room for growth. We have not necessarily thought about archive storage yet, but the MiSeq itself has a 750gb HD. I don't feel comfortable keeping all of the data for a given month/few months on that, so I imagine we will clean it out periodically and put this data into network HD storage. How much extra data/file sizes are produced downstream? I imagine we won't duplicate the MiSeq files before. moving them to the analysis environment.

      Thanks for the insight.

      Comment


      • #4
        Depending on how you schedule the runs (number of cycles, SE vs PE etc) the size of the original data folder will vary but you can expect it to be somewhere between these values (e.g. 50x7 ~12G to 300x8x8x300 ~60G). After demultiplexing (bcl2fastq) the size will increase by about 50% (so data folders would become ~18 to ~80G in above example). We don't use on-board MiSeq software, but I expect if you did that then the folder sizes would likely be similar to final sizes above.

        Comment


        • #5
          Originally posted by flyinglotus View Post
          We will probably be performing 4-5 runs a month, with room for growth. We have not necessarily thought about archive storage yet, but the MiSeq itself has a 750gb HD. I don't feel comfortable keeping all of the data for a given month/few months on that, so I imagine we will clean it out periodically and put this data into network HD storage. How much extra data/file sizes are produced downstream? I imagine we won't duplicate the MiSeq files before. moving them to the analysis environment.

          Thanks for the insight.
          The MiSeq software can, and SHOULD be configured to copy its data to a network storage device as it is collected. There is no need for manual moving of the data. The network storage device you set up to receive the data should be fault tolerant (i.e. some type of RAID configuration) and ideally from there a second, archival copy is made immediately after the run.

          Comment


          • #6
            As well, you should consider what data you actually need to keep. If you set up your analyses well, with an actual software-defined pipeline of some sort, which you version (along with all software components used in the pipeline) then you can recreate downstream files. Meaning you generally keep/archive:

            1)Raw input data (this could be BCL files, but you may reasonably opt to just keep the de-multiplexed FASTQ files). This is generally quite a bit smaller than the complete run output from a MiSeq.

            2)Detailed documentation of the workflow that was done on the data. Yous separately archive all your software, pipelines, databases, etc (in a versioned manner)

            3)Your final results (and even this isn't absolutely required, particularly for archiving)

            You should structure everything so you can recreate your analysis and all downstream results files, exactly, at any time. Granted this is actually harder since you are using commercial software and have little control over version changes an updates often, in terms of keeping around old copies. But you still want to strive towards reproducibility.

            Otherwise everything you have set up seems on the right track. The exact specs of your workstation depend on the analyses you will do within CLC workbench. I would go with at least a few TB of RAIDed storage on the workstation itself. If you haven't already bought it, Qiagen/CLCBio has a collaboration with PSSC labs. PSSC builds a workstation that is configurable themselves, but you can also order the whole thing as a turn-key solution from CLC Bio still I believe.

            Comment


            • #7
              Thank you all for your responses. We are looking into our options for the downstream analysis, and feel most likely we will only keep FASTQ files and potentially BAM files. All of the intermediate files (generated from CLC, most likely) we feel are probably discardable.

              WIll update when we have started generating data.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Techniques and Challenges in Conservation Genomics
                by seqadmin



                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                Avian Conservation
                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                03-08-2024, 10:41 AM
              • seqadmin
                The Impact of AI in Genomic Medicine
                by seqadmin



                Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                02-26-2024, 02:07 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 03-14-2024, 06:13 AM
              0 responses
              32 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-08-2024, 08:03 AM
              0 responses
              71 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-07-2024, 08:13 AM
              0 responses
              80 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-06-2024, 09:51 AM
              0 responses
              68 views
              0 likes
              Last Post seqadmin  
              Working...
              X