SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa



Similar Threads
Thread Thread Starter Forum Replies Last Post
Question: What size should I expect from a SFF file? create.share Bioinformatics 1 05-18-2014 11:00 AM
How does insert size affect downstream bioinfo analysis? ymc Bioinformatics 0 02-10-2014 03:26 AM
Illumina Miseq bam file Pennaki Bioinformatics 7 03-30-2013 02:14 AM
Filtering Illumina data to reduce file size Mona Bioinformatics 5 10-11-2012 05:19 PM
Illumina file format question marcela5555 General 1 11-12-2010 08:44 AM

Reply
 
Thread Tools
Old 07-15-2015, 09:39 AM   #1
flyinglotus
Junior Member
 
Location: USA

Join Date: Jul 2015
Posts: 3
Default Illumina MiSeq file size/downstream analysis question

At my lab We are starting to organize all of the infrastructure we will need in our lab for bringing in NGS. We will be doing a 15kb panel on the MiSeq using v3 reagents. We will be generating ~10-15Gb of sequence per run.

Our downstream analysis will be in CLC Biogenomics workbench. It is my understanding that we will demultiplex our MiSeq files, import them into the workbench software on our custom tower, and process from there.

Is there anyone who has experience with the workflow of CLC genomics workbench from the Illumina platforms? Our analysis computer will have ~4TB of storage, and we were thinking of obtaining ~10-15 TB of network storage.

In addition to the 15Gb of MiSeq data, is there any way to estimate the size and number of files that we will generate in CLC while we work towards the final VCF?

Sorry for the long winded question. Any information will help greatly
flyinglotus is offline   Reply With Quote
Old 07-15-2015, 09:45 AM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,989
Default

How many runs do you expect to do each month/over a year? Have you thought about a long term archival storage solution (or don't expect a need for that)? Are you going to use on-board software on MiSeq to do the demultiplexing or would BaseSpace be in play?
GenoMax is offline   Reply With Quote
Old 07-15-2015, 10:31 AM   #3
flyinglotus
Junior Member
 
Location: USA

Join Date: Jul 2015
Posts: 3
Default

We will probably be performing 4-5 runs a month, with room for growth. We have not necessarily thought about archive storage yet, but the MiSeq itself has a 750gb HD. I don't feel comfortable keeping all of the data for a given month/few months on that, so I imagine we will clean it out periodically and put this data into network HD storage. How much extra data/file sizes are produced downstream? I imagine we won't duplicate the MiSeq files before. moving them to the analysis environment.

Thanks for the insight.
flyinglotus is offline   Reply With Quote
Old 07-15-2015, 10:59 AM   #4
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,989
Default

Depending on how you schedule the runs (number of cycles, SE vs PE etc) the size of the original data folder will vary but you can expect it to be somewhere between these values (e.g. 50x7 ~12G to 300x8x8x300 ~60G). After demultiplexing (bcl2fastq) the size will increase by about 50% (so data folders would become ~18 to ~80G in above example). We don't use on-board MiSeq software, but I expect if you did that then the folder sizes would likely be similar to final sizes above.
GenoMax is offline   Reply With Quote
Old 07-16-2015, 06:00 AM   #5
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,170
Default

Quote:
Originally Posted by flyinglotus View Post
We will probably be performing 4-5 runs a month, with room for growth. We have not necessarily thought about archive storage yet, but the MiSeq itself has a 750gb HD. I don't feel comfortable keeping all of the data for a given month/few months on that, so I imagine we will clean it out periodically and put this data into network HD storage. How much extra data/file sizes are produced downstream? I imagine we won't duplicate the MiSeq files before. moving them to the analysis environment.

Thanks for the insight.
The MiSeq software can, and SHOULD be configured to copy its data to a network storage device as it is collected. There is no need for manual moving of the data. The network storage device you set up to receive the data should be fault tolerant (i.e. some type of RAID configuration) and ideally from there a second, archival copy is made immediately after the run.
kmcarr is offline   Reply With Quote
Old 07-16-2015, 06:53 AM   #6
dgaston
Junior Member
 
Location: Halifax, NS Canada

Join Date: Dec 2012
Posts: 4
Default

As well, you should consider what data you actually need to keep. If you set up your analyses well, with an actual software-defined pipeline of some sort, which you version (along with all software components used in the pipeline) then you can recreate downstream files. Meaning you generally keep/archive:

1)Raw input data (this could be BCL files, but you may reasonably opt to just keep the de-multiplexed FASTQ files). This is generally quite a bit smaller than the complete run output from a MiSeq.

2)Detailed documentation of the workflow that was done on the data. Yous separately archive all your software, pipelines, databases, etc (in a versioned manner)

3)Your final results (and even this isn't absolutely required, particularly for archiving)

You should structure everything so you can recreate your analysis and all downstream results files, exactly, at any time. Granted this is actually harder since you are using commercial software and have little control over version changes an updates often, in terms of keeping around old copies. But you still want to strive towards reproducibility.

Otherwise everything you have set up seems on the right track. The exact specs of your workstation depend on the analyses you will do within CLC workbench. I would go with at least a few TB of RAIDed storage on the workstation itself. If you haven't already bought it, Qiagen/CLCBio has a collaboration with PSSC labs. PSSC builds a workstation that is configurable themselves, but you can also order the whole thing as a turn-key solution from CLC Bio still I believe.
dgaston is offline   Reply With Quote
Old 07-28-2015, 05:46 PM   #7
flyinglotus
Junior Member
 
Location: USA

Join Date: Jul 2015
Posts: 3
Default

Thank you all for your responses. We are looking into our options for the downstream analysis, and feel most likely we will only keep FASTQ files and potentially BAM files. All of the intermediate files (generated from CLC, most likely) we feel are probably discardable.

WIll update when we have started generating data.
flyinglotus is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:34 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO