![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Question: What size should I expect from a SFF file? | create.share | Bioinformatics | 1 | 05-18-2014 11:00 AM |
How does insert size affect downstream bioinfo analysis? | ymc | Bioinformatics | 0 | 02-10-2014 03:26 AM |
Illumina Miseq bam file | Pennaki | Bioinformatics | 7 | 03-30-2013 02:14 AM |
Filtering Illumina data to reduce file size | Mona | Bioinformatics | 5 | 10-11-2012 05:19 PM |
Illumina file format question | marcela5555 | General | 1 | 11-12-2010 08:44 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Junior Member
Location: USA Join Date: Jul 2015
Posts: 3
|
![]()
At my lab We are starting to organize all of the infrastructure we will need in our lab for bringing in NGS. We will be doing a 15kb panel on the MiSeq using v3 reagents. We will be generating ~10-15Gb of sequence per run.
Our downstream analysis will be in CLC Biogenomics workbench. It is my understanding that we will demultiplex our MiSeq files, import them into the workbench software on our custom tower, and process from there. Is there anyone who has experience with the workflow of CLC genomics workbench from the Illumina platforms? Our analysis computer will have ~4TB of storage, and we were thinking of obtaining ~10-15 TB of network storage. In addition to the 15Gb of MiSeq data, is there any way to estimate the size and number of files that we will generate in CLC while we work towards the final VCF? Sorry for the long winded question. Any information will help greatly |
![]() |
![]() |
![]() |
#2 |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,088
|
![]()
How many runs do you expect to do each month/over a year? Have you thought about a long term archival storage solution (or don't expect a need for that)? Are you going to use on-board software on MiSeq to do the demultiplexing or would BaseSpace be in play?
|
![]() |
![]() |
![]() |
#3 |
Junior Member
Location: USA Join Date: Jul 2015
Posts: 3
|
![]()
We will probably be performing 4-5 runs a month, with room for growth. We have not necessarily thought about archive storage yet, but the MiSeq itself has a 750gb HD. I don't feel comfortable keeping all of the data for a given month/few months on that, so I imagine we will clean it out periodically and put this data into network HD storage. How much extra data/file sizes are produced downstream? I imagine we won't duplicate the MiSeq files before. moving them to the analysis environment.
Thanks for the insight. |
![]() |
![]() |
![]() |
#4 |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,088
|
![]()
Depending on how you schedule the runs (number of cycles, SE vs PE etc) the size of the original data folder will vary but you can expect it to be somewhere between these values (e.g. 50x7 ~12G to 300x8x8x300 ~60G). After demultiplexing (bcl2fastq) the size will increase by about 50% (so data folders would become ~18 to ~80G in above example). We don't use on-board MiSeq software, but I expect if you did that then the folder sizes would likely be similar to final sizes above.
|
![]() |
![]() |
![]() |
#5 | |
Senior Member
Location: USA, Midwest Join Date: May 2008
Posts: 1,178
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
#6 |
Junior Member
Location: Halifax, NS Canada Join Date: Dec 2012
Posts: 4
|
![]()
As well, you should consider what data you actually need to keep. If you set up your analyses well, with an actual software-defined pipeline of some sort, which you version (along with all software components used in the pipeline) then you can recreate downstream files. Meaning you generally keep/archive:
1)Raw input data (this could be BCL files, but you may reasonably opt to just keep the de-multiplexed FASTQ files). This is generally quite a bit smaller than the complete run output from a MiSeq. 2)Detailed documentation of the workflow that was done on the data. Yous separately archive all your software, pipelines, databases, etc (in a versioned manner) 3)Your final results (and even this isn't absolutely required, particularly for archiving) You should structure everything so you can recreate your analysis and all downstream results files, exactly, at any time. Granted this is actually harder since you are using commercial software and have little control over version changes an updates often, in terms of keeping around old copies. But you still want to strive towards reproducibility. Otherwise everything you have set up seems on the right track. The exact specs of your workstation depend on the analyses you will do within CLC workbench. I would go with at least a few TB of RAIDed storage on the workstation itself. If you haven't already bought it, Qiagen/CLCBio has a collaboration with PSSC labs. PSSC builds a workstation that is configurable themselves, but you can also order the whole thing as a turn-key solution from CLC Bio still I believe. |
![]() |
![]() |
![]() |
#7 |
Junior Member
Location: USA Join Date: Jul 2015
Posts: 3
|
![]()
Thank you all for your responses. We are looking into our options for the downstream analysis, and feel most likely we will only keep FASTQ files and potentially BAM files. All of the intermediate files (generated from CLC, most likely) we feel are probably discardable.
WIll update when we have started generating data. |
![]() |
![]() |
![]() |
Thread Tools | |
|
|