Seqanswers Leaderboard Ad

**GenoMax** · 07-15-2015, 08:45 AM

How many runs do you expect to do each month/over a year? Have you thought about a long term archival storage solution (or don't expect a need for that)? Are you going to use on-board software on MiSeq to do the demultiplexing or would BaseSpace be in play?

**flyinglotus** · 07-15-2015, 09:31 AM

We will probably be performing 4-5 runs a month, with room for growth. We have not necessarily thought about archive storage yet, but the MiSeq itself has a 750gb HD. I don't feel comfortable keeping all of the data for a given month/few months on that, so I imagine we will clean it out periodically and put this data into network HD storage. How much extra data/file sizes are produced downstream? I imagine we won't duplicate the MiSeq files before. moving them to the analysis environment.

Thanks for the insight.

**GenoMax** · 07-15-2015, 09:59 AM

Depending on how you schedule the runs (number of cycles, SE vs PE etc) the size of the original data folder will vary but you can expect it to be somewhere between these values (e.g. 50x7 ~12G to 300x8x8x300 ~60G). After demultiplexing (bcl2fastq) the size will increase by about 50% (so data folders would become ~18 to ~80G in above example). We don't use on-board MiSeq software, but I expect if you did that then the folder sizes would likely be similar to final sizes above.

**kmcarr** · 07-16-2015, 05:00 AM

Originally posted by flyinglotus View Post

We will probably be performing 4-5 runs a month, with room for growth. We have not necessarily thought about archive storage yet, but the MiSeq itself has a 750gb HD. I don't feel comfortable keeping all of the data for a given month/few months on that, so I imagine we will clean it out periodically and put this data into network HD storage. How much extra data/file sizes are produced downstream? I imagine we won't duplicate the MiSeq files before. moving them to the analysis environment.

Thanks for the insight.

The MiSeq software can, and SHOULD be configured to copy its data to a network storage device as it is collected. There is no need for manual moving of the data. The network storage device you set up to receive the data should be fault tolerant (i.e. some type of RAID configuration) and ideally from there a second, archival copy is made immediately after the run.

**dgaston** · 07-16-2015, 05:53 AM

As well, you should consider what data you actually need to keep. If you set up your analyses well, with an actual software-defined pipeline of some sort, which you version (along with all software components used in the pipeline) then you can recreate downstream files. Meaning you generally keep/archive:

1)Raw input data (this could be BCL files, but you may reasonably opt to just keep the de-multiplexed FASTQ files). This is generally quite a bit smaller than the complete run output from a MiSeq.

2)Detailed documentation of the workflow that was done on the data. Yous separately archive all your software, pipelines, databases, etc (in a versioned manner)

3)Your final results (and even this isn't absolutely required, particularly for archiving)

You should structure everything so you can recreate your analysis and all downstream results files, exactly, at any time. Granted this is actually harder since you are using commercial software and have little control over version changes an updates often, in terms of keeping around old copies. But you still want to strive towards reproducibility.

Otherwise everything you have set up seems on the right track. The exact specs of your workstation depend on the analyses you will do within CLC workbench. I would go with at least a few TB of RAIDed storage on the workstation itself. If you haven't already bought it, Qiagen/CLCBio has a collaboration with PSSC labs. PSSC builds a workstation that is configurable themselves, but you can also order the whole thing as a turn-key solution from CLC Bio still I believe.

**flyinglotus** · 07-28-2015, 04:46 PM

Thank you all for your responses. We are looking into our options for the downstream analysis, and feel most likely we will only keep FASTQ files and potentially BAM files. All of the intermediate files (generated from CLC, most likely) we feel are probably discardable.

WIll update when we have started generating data.

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 13 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Illumina MiSeq file size/downstream analysis question

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News