SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Sequence Read Archive downloads and Aspera iquigley Bioinformatics 4 02-04-2013 11:18 PM
PubMed: The sequence read archive. Newsbot! Literature Watch 0 04-29-2011 02:10 AM
Short Read Archive format problems Ender985 Bioinformatics 7 07-28-2010 11:23 AM
Short Read Archive format dbrami Bioinformatics 2 04-28-2010 12:15 PM
Very Short Read aligner Rupinder Bioinformatics 1 06-02-2009 07:10 PM

Reply
 
Thread Tools
Old 02-14-2011, 05:56 AM   #1
nickloman
Senior Member
 
Location: Birmingham, UK

Join Date: Jul 2009
Posts: 353
Default Short Read Archive Canned

More details here:
http://pathogenomics.bham.ac.uk/blog...rchive-canned/

Where will you submit your data now?
nickloman is offline   Reply With Quote
Old 02-14-2011, 07:20 AM   #2
Joann
Senior Member
 
Location: Woodbridge CT

Join Date: Oct 2008
Posts: 218
Default Omg!

Taking on the ad hoc centralization of such an important shared database resource at the forefront of such an important developing scientific field and then just dropping it from open access sight is the pits! NCBI, I am pointing fingers at you. At the very least, local, institutional science libraries and infrastructure should have been primed to develop their own SRA capacities, (and then staffed by their own employees of course) as your SRA curation was clearly not founded upon reliable scientific funding committments.
Joann is offline   Reply With Quote
Old 02-14-2011, 07:42 AM   #3
ECO
--Site Admin--
 
Location: SF Bay Area, CA, USA

Join Date: Oct 2007
Posts: 1,290
Default

A great idea for a community undertaking...
ECO is offline   Reply With Quote
Old 02-14-2011, 07:44 AM   #4
ECO
--Site Admin--
 
Location: SF Bay Area, CA, USA

Join Date: Oct 2007
Posts: 1,290
Default

Wow, didn't realize we're reposting an anonymous comment on a blog...
ECO is offline   Reply With Quote
Old 02-14-2011, 07:55 AM   #5
nickloman
Senior Member
 
Location: Birmingham, UK

Join Date: Jul 2009
Posts: 353
Default

I did wonder about that myself but then decided who has the time to fake up official NCBI communication?

But anyhow, I've had independent confirmation from several sources that it is true.
nickloman is offline   Reply With Quote
Old 02-14-2011, 07:57 AM   #6
nickloman
Senior Member
 
Location: Birmingham, UK

Join Date: Jul 2009
Posts: 353
Default

According to the email it will be around for some months yet ...

Not clear as yet what will happen to the already submitted data.
nickloman is offline   Reply With Quote
Old 02-14-2011, 08:08 AM   #7
nickloman
Senior Member
 
Location: Birmingham, UK

Join Date: Jul 2009
Posts: 353
Default

If you think about it rationally, there's no way you can have a centralised single resource for sequence data volumes which are doubling every year or so.
nickloman is offline   Reply With Quote
Old 02-14-2011, 08:25 AM   #8
ECO
--Site Admin--
 
Location: SF Bay Area, CA, USA

Join Date: Oct 2007
Posts: 1,290
Default

Why not? I can't easily find how much data is in the SRA as of now...

It might be expensive to do from scratch, but it's a type of effort that with the right pitch, someone like Google could be persuaded to host. For humanitarian reasons and the tax write off.
ECO is offline   Reply With Quote
Old 02-14-2011, 08:28 AM   #9
nickloman
Senior Member
 
Location: Birmingham, UK

Join Date: Jul 2009
Posts: 353
Default

OK, it's *possible*. But it's going to be very expensive.

There's the networking costs / limits to think about as well as storage.

Amazon might be a good choice to step in! A great way of attracting people to their cloud computing services.

Of course there needs to be some degree of replication so we are not dependent on a single organisation.
nickloman is offline   Reply With Quote
Old 02-14-2011, 08:30 AM   #10
ECO
--Site Admin--
 
Location: SF Bay Area, CA, USA

Join Date: Oct 2007
Posts: 1,290
Default

Right. If all the data is IN amazon...the worldwide bandwidth req's are much lower if you're using amazon's tools.
ECO is offline   Reply With Quote
Old 02-14-2011, 08:32 AM   #11
csoong
Member
 
Location: Connecticut

Join Date: Jun 2009
Posts: 74
Default

Another possibility to consider would be to only share certain variation files. But that is dependent on what defines variants and how variants are characterized and is sort of confined to DNA topics. For expression level data, perhaps some standardized format could come along as well.
csoong is offline   Reply With Quote
Old 02-14-2011, 08:34 AM   #12
GW_OK
Senior Member
 
Location: Oklahoma

Join Date: Sep 2009
Posts: 326
Default

PacBio should fold it into their mega New Biology thingy.
GW_OK is offline   Reply With Quote
Old 02-14-2011, 10:31 AM   #13
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 1,841
Default

Quote:
Originally Posted by nickloman View Post
If you think about it rationally, there's no way you can have a centralised single resource for sequence data volumes which are doubling every year or so.
Doubling would be okay. That is close enough to Moore's law that investments of the same amount of money per year in storage would suffice. The problem is that next gen sequencing is expanding at hyper-Moore's law rates. See:

http://www.economist.com/node/16349358

(Figure 1)



Around 2005-2006, you see an inflection point. Before that point, Moore's law roughly kept pace with sequence cost. But since then (at least at the Broad) the semi-log slope tips downward for sequencing. That means you need to exponentially increase your expenditures on sequence storage if you plan to spend the same amount on sequencing. Alternatively you can come up with specialized storage solutions, etc.

But, ultimately one of two things happens:

(1) Front-end computational cost de facto limits the drop in sequencing costs -- at which point sequencing costs lock at Moore's law rates.

(2) "Sequencing" reaches fruition -- reading DNA sequences costs no more than storing them. Congratulations your new storage medium is DNA.

--
Phillip

Last edited by pmiguel; 02-14-2011 at 10:32 AM. Reason: typo
pmiguel is offline   Reply With Quote
Old 02-14-2011, 11:18 AM   #14
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 522
Default Big Bams and Bit Torrents

Perhaps using a subset of the bit torrent protocol might be an answer. I guess there would have to be a "you must have served up half as much as you've downloaded" rule or something to prevent getting but not giving.

Security's a beach.

Last edited by Richard Finney; 02-14-2011 at 11:26 AM.
Richard Finney is offline   Reply With Quote
Old 02-14-2011, 11:22 AM   #15
Michael.James.Clark
Senior Member
 
Location: Palo Alto

Join Date: Apr 2009
Posts: 213
Default

It's kind of funny how the Science articles about data deluge basically precipitated this announcement. There's been a lot of blog-o-sphere buzz about data deluge and more than a couple of them mentioning off-hand SRA and its attempt to handle it.

So far this is a rumor. It happens to be a very believable rumor given the funding issue and ever-increasing need for storage, but let's not say it's canned before we're sure.

I think while the intent of SRA was good, the execution was not. Anyone who's dealt with it can tell you how much extra work getting data into their formats and uploading it was, not to mention the effort involved in retrieving data from it.

It's also just not a very sustainable thing for the government to sponsor this way. Transferring giant data sets through the net is time and bandwidth consuming, not to mention the upkeep of an ever-expanding storage space.

All that said, I don't like the whole "cloud" solution very much either. The major reason is the lack of control over privacy. At the very least, SRA did a good job protecting privacy (although their mechanism for doing so was quite clunky). Storing personal genetic data on a computer system owned by a third party simply does not sit well with me. It's kind of a funny idea to be "sharing" personal genetic data anyway, but at the very least, attempts to protect privacy need to be made and it's hard to envision how that's accomplished when the data itself is on a third party computer.

Perhaps a Biotorrent type solution is the best way to share this type of data. Something that can be reasonably secure while not consuming massive bandwidth on both ends.

I'm also not convinced about simply sharing variants. While it's true that it will save a lot of storage space, variants are not inherently comparable. Sequencing platform plays a role, but even more significant are the significant improvements in alignment and variant detection over the past few years. Realign and re-call variants on the Watson genome and I bet you'll end up with vastly different numbers from what were reported, for example. But if you just have the variants, you can't realign and recall, and therefore you can't really use that data for a true comparison.

I proposed in a recent blog article that someone should try to create a project where all the world's public sequence data is kept continually updated to modern standards. Would it be expensive? You betcha. But it would also be a very powerful resource while also avoiding the whole shoe-horning problem that SRA ran into with its formatting issues.
__________________
Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
Projects: U87MG whole genome sequence [Website] [Paper]
Michael.James.Clark is offline   Reply With Quote
Old 02-14-2011, 03:47 PM   #16
GERALD
Member
 
Location: Houston

Join Date: Jun 2010
Posts: 17
Default

Are we sure this is real? I hope that one or more private companies have the foresight to step up to the plate on this. The commercial potential would be enormous. They just have to be big enough to cover the enormous overhead of the data. The ad revenue alone would be incentive enough. Can we make a collective appeal to say... Google?
GERALD is offline   Reply With Quote
Old 02-14-2011, 08:54 PM   #17
flxlex
Moderator
 
Location: Oslo, Norway

Join Date: Nov 2008
Posts: 393
Default

Quote:
Originally Posted by nickloman View Post
Where will you submit your data now?
The European nucleotide archive?
http://www.ebi.ac.uk/ena/about/page....ra_submissions
flxlex is offline   Reply With Quote
Old 02-16-2011, 12:00 AM   #18
mwatson
Member
 
Location: Roslin, UK

Join Date: Aug 2010
Posts: 11
Default

Hmmm, I wonder how this sits with the following article though?

President Obama Proposes Budget Increases for NIH, CDC, NSF, and FDA

For me this is very worrying as it represents a big change in the way in which biodata is managed. NCBI, EBI and DDBJ have *always* managed public, biological data. That's what they do and we love them for it. If the NCBI pull out of it now, even if it is just the SRA (just? just the largest collection of data from one of the most exciting technologies on the planet right now...), it's a worrying development.
mwatson is offline   Reply With Quote
Old 02-16-2011, 12:39 AM   #19
mwatson
Member
 
Location: Roslin, UK

Join Date: Aug 2010
Posts: 11
Default

Quote:
Originally Posted by Michael.James.Clark View Post
I'm also not convinced about simply sharing variants. While it's true that it will save a lot of storage space, variants are not inherently comparable. Sequencing platform plays a role, but even more significant are the significant improvements in alignment and variant detection over the past few years. Realign and re-call variants on the Watson genome and I bet you'll end up with vastly different numbers from what were reported, for example. But if you just have the variants, you can't realign and recall, and therefore you can't really use that data for a true comparison.
Isn't the proposal to store variants to store them in such a way that the original read can be reconstructed?
mwatson is offline   Reply With Quote
Old 02-16-2011, 04:25 AM   #20
NGSfan
Senior Member
 
Location: Austria

Join Date: Apr 2009
Posts: 175
Default

I never liked the SRA. It was incredibly difficult to get data out of it - not to mention to know what datasets you are getting!

Another thing - why this bloated SRF format?

Why aren't we just uploading bam files?

They come already with the read quality scores, aligned and compressed. You can then load it up into a viewer and easily see what the authors saw in their results.

And if you like, you can extract the sequences (Bam to Fastq) and realign them yourself with your favorite aligner.

Saving variants is a good idea - but not now when the methodology for variant detection is so volatile.
NGSfan is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:18 PM.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.