SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Data Storage after HiSeq Upgrade sklages Illumina/Solexa 9 06-08-2011 07:48 AM
Data Storage Space NGS analyst Bioinformatics 1 01-10-2011 08:22 AM
Data storage rdeborja Bioinformatics 2 11-28-2010 01:46 AM
PubMed: Long-term Stability of Demethylation after Transient Exposure to 5-Aza-2'-Deo Newsbot! Literature Watch 0 07-01-2010 02:02 AM
PubMed: Short-term antibiotic treatment has differing long-term impacts on the human Newsbot! Literature Watch 0 03-31-2010 02:00 AM

Reply
 
Thread Tools
Old 01-04-2012, 09:38 PM   #1
gendxdoc
Member
 
Location: Seattle

Join Date: Mar 2008
Posts: 12
Default Long Term Data Storage

I am in the process of setting up a NGS core facility. I will be starting with a single HiSeq 1000 with an IlluminaCompute Tier0 analysis server. In a past life, I ran a NGS facility, in which we had a "medium-term" storage server and long-term tape back up system. File sizes have gotten so large, I'm not sure how practical it is to back up data on tape or deal with the hassle putting data on tape -- and retrieving if needed again in the future.

A few questions for all of you:
1. What data are you keeping ?
-- keeping BCL = 330Gb
-- keeping BAM = 330Gb
-- total = 660Gb per run (paired end, 2 x 101bp)
2. What long term data storage media are you using ?
3. I am a geneticist/biologist --- I'm not an IT professional -- what would be the easiest solution for me ? (at some point, I will be hiring an informaticist/computational biologist)

4. Would it be easier to store on external drives ?
5. Do any of you back up data and send to another facility for storage - such as Iron Mountain ?

Any advice you can give would be appreciated.
Thank you,
Michael
gendxdoc is offline   Reply With Quote
Old 01-04-2012, 10:14 PM   #2
maasha
Senior Member
 
Location: Denmark

Join Date: Apr 2009
Posts: 153
Default

I a very few years you save the DNA libraries only.
maasha is offline   Reply With Quote
Old 01-04-2012, 10:48 PM   #3
nickloman
Senior Member
 
Location: Birmingham, UK

Join Date: Jul 2009
Posts: 356
Default

It seems a pragmatic solution to cost in a terabyte disk per sequencing run and use that as backup, assuming you have a place to store the disks.

You might look into Basespace (illumina cloud solution) which i understand should be available for hiseq.
nickloman is offline   Reply With Quote
Old 01-09-2012, 07:00 AM   #4
colindaven
Senior Member
 
Location: Germany

Join Date: Oct 2008
Posts: 401
Default

We decided against external drives because of
a) space
b) organisation
c) lack of mirroring (RAID)

The last point is the most critical because we are required to save data for 10 years at the University. This can (hopefully) be guaranteed by tape and (maintained) RAID backups but not by off the shelf external HDs.

We also have tape and spatially separate hard drive backups in case the server room burns down.
colindaven is offline   Reply With Quote
Old 01-09-2012, 07:07 AM   #5
nickloman
Senior Member
 
Location: Birmingham, UK

Join Date: Jul 2009
Posts: 356
Default

Two (bare) disks, two separate locations?
nickloman is offline   Reply With Quote
Old 01-09-2012, 07:08 AM   #6
nickloman
Senior Member
 
Location: Birmingham, UK

Join Date: Jul 2009
Posts: 356
Default

Also, these kind of blanket University data policies don't make sense in context of sequencing. They should understand the problem first, then make a data retention policy.
nickloman is offline   Reply With Quote
Old 01-09-2012, 08:01 AM   #7
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 699
Default The new floppy disk ...

These are the new keychain USBs for large data :

http://www.newegg.com/Product/Produc...20and%20higher

.... >1TB portable hard drives.

Just buy enough to make 2 or 3 backups. Keep the backup separated and verify.

This is labor intensive.
Richard Finney is offline   Reply With Quote
Old 01-09-2012, 08:05 AM   #8
nickloman
Senior Member
 
Location: Birmingham, UK

Join Date: Jul 2009
Posts: 356
Default

Richard - not sure I understood your message. You seem to be suggesting these are USB flash solutions, but you actually linked to regular hard disks with USB interfaces. It is true that there are 1TB flash disks, but they are currently about $2000.
nickloman is offline   Reply With Quote
Old 01-09-2012, 08:08 AM   #9
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 699
Default

Yep. The greater than 1TB portable hard drive is the new floppy disk.
Richard Finney is offline   Reply With Quote
Old 01-09-2012, 08:11 AM   #10
nickloman
Senior Member
 
Location: Birmingham, UK

Join Date: Jul 2009
Posts: 356
Default

Ah right, got confused by the term "keychain" which made me think of flash disks. But yes, I agree, and judicious use of USB disks is a very cost-effective storage solution in my opinion. I am certainly never going back to tape backup!
nickloman is offline   Reply With Quote
Old 01-09-2012, 08:12 AM   #11
nickloman
Senior Member
 
Location: Birmingham, UK

Join Date: Jul 2009
Posts: 356
Default

The nice thing about USB disks is that if your sequencer dumps out 1TB of data per run, then cost in 2 x 1TB USB disks per run and you have a resilient backup solution. Given that a HiSeq run might be $10,000 of consumables, $200 more for the disks can be absorbed easily.
nickloman is offline   Reply With Quote
Old 01-09-2012, 08:13 AM   #12
nickloman
Senior Member
 
Location: Birmingham, UK

Join Date: Jul 2009
Posts: 356
Default

Contrast that with enterprise-grade solutions and you are talking more like $1000/TB plus all the administrative overhead of keeping these solutions going. Amazon S3 is another option but costs can mount up over time.
nickloman is offline   Reply With Quote
Old 01-09-2012, 04:39 PM   #13
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 823
Default

With a 2.5" hard drive as your file backup, storing the samples may almost end up taking more room than storing the data.

I agree with the purchase of 2 hard drives for each run. The university then has a visual idea of how their 10-year policy is working out, and the hard drives won't use any power when they're not plugged into anything (unlike a dedicated network backup, which will consume power on the off chance that you'll want a 5kb file from your 7-year-old sequencing data with latency of less than a second).
gringer is offline   Reply With Quote
Old 01-09-2012, 10:31 PM   #14
sklages
Senior Member
 
Location: Berlin, DE

Join Date: May 2008
Posts: 620
Default

What about the CIFs and corresponding files? There are situations where there is need to externally re-basecall the data with bustard. With BCLs alone this is not possible.
Storing CIF plus corresponding files takes up to 3.5TB per HiSeq flowcell ...

IMHO USB disks are not suited for such amount of data (especially when you are running more than one machine).
sklages is offline   Reply With Quote
Old 01-09-2012, 11:33 PM   #15
ulz_peter
Senior Member
 
Location: Graz, Austria

Join Date: Feb 2010
Posts: 219
Default

I've heard that the HiSeq autmatically dismisses the image files, isn't that true?

Anyways, I don't think it makes sense to save both .bcl and .fastq files as they can easily be converted (at least from bcl to fastq, don't know the other way round)

Anf for 330 GB that could easily be saved on a hard disk (would be 3 runs per TB, right?)
ulz_peter is offline   Reply With Quote
Old 01-10-2012, 12:16 AM   #16
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 823
Default

Quote:
What about the CIFs and corresponding files? There are situations where there is need to externally re-basecall the data with bustard. With BCLs alone this is not possible.
Storing CIF plus corresponding files takes up to 3.5TB per HiSeq flowcell ...
Both Our HiscanSQ and Solid4 chuck out the image files by default once they've been converted, but store the intensity files for recalling. The solutions here are for data backup, and if you haven't worked out the most likely sequence for analysis (from the intensity files) before you do the backup, there's unlikely to be any need for that post-backup.

Quote:
IMHO USB disks are not suited for such amount of data (especially when you are running more than one machine).
You can get hot-swap SATA drive bays, which make the process easier than using a USB enclosure. There's no need to involve the slower USB connection:

http://www.pc-pitstop.com/sata_enclosures/

Last edited by gringer; 01-10-2012 at 12:24 AM. Reason: added link to hot-swap SATA enclosures
gringer is offline   Reply With Quote
Old 01-10-2012, 12:45 AM   #17
sklages
Senior Member
 
Location: Berlin, DE

Join Date: May 2008
Posts: 620
Default

Quote:
Originally Posted by gringer View Post
Both Our HiscanSQ and Solid4 chuck out the image files by default once they've been converted, but store the intensity files for recalling. The solutions here are for data backup, and if you haven't worked out the most likely sequence for analysis (from the intensity files) before you do the backup, there's unlikely to be any need for that post-backup.
Perfectly true if it is your own data; if you provide data as a sequencing core facility for "external" people, this is not necessarily true :-(

Quote:
You can get hot-swap SATA drive bays, which make the process easier than using a USB enclosure. There's no need to involve the slower USB connection:
http://www.pc-pitstop.com/sata_enclosures/
Yes, we are not using external USB disks ... we are using SATA jbods.
sklages is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:41 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO