SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Low frequency variant caller for any ploidy level me_myself_andI Bioinformatics 16 04-21-2014 07:39 AM
DEXseq - very low numbers of counts kajot RNA Sequencing 5 04-02-2014 09:17 AM
DEXSeq gene level counts Julien Roux Bioinformatics 3 11-28-2012 12:31 AM

Reply
 
Thread Tools
Old 06-05-2014, 10:05 AM   #1
JamieWizard
Member
 
Location: London

Join Date: Sep 2013
Posts: 10
Default SRA top level studies counts, why is 2013 so low?

Hi all,

I've been looking at the Sequence Read Archive (SRA) short read meta-data using the Bioconductor extracted SQLite data. (available from
http://www.bioconductor.org/packages...tml/SRAdb.html)

One thing that is quite puzzling is out of All of the top-level studies why there are so few for 2013?

SQL Queries for the bioconductor data extracted from the SRA as of December 2013 show the following top-level study counts: -

2005|64
2006|38
2007|94
2008|269
2009|893
2010|2631
2011|4077
2012|5208
2013|724

One can see the increasing trend then fall-off from 2012. Wondering if anyone has any ideas why this might be?

Best regards,
Jamie
JamieWizard is offline   Reply With Quote
Old 06-05-2014, 10:17 AM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,748
Default

Following is speculation.

For a while many people were under the impression (including me) that [email protected] was closing down due to lack of funding. That is NOT the case. Apparently various NIH Institute Directors got together and decided that SRA was important and had to be kept going.

This fact has not been widely publicized (as opposed to the original closure that was). Perhaps this is reflected in the numbers you are seeing.
GenoMax is offline   Reply With Quote
Old 06-05-2014, 10:46 AM   #3
Bukowski
Senior Member
 
Location: Aberdeen, Scotland

Join Date: Jan 2010
Posts: 350
Default

The following is also speculation. The drive to deposit data in publicly available archives isn't nearly as strong for NGS data as it was for e.g. microarrays. And there probably isn't as much of an appetite to be the worlds dumping ground for terabytes of poorly curated data.

See also: privacy concerns of having peoples genetic data splurged all over the internet.
Bukowski is offline   Reply With Quote
Old 06-06-2014, 07:32 PM   #4
ShaunMahony
Member
 
Location: University Park, PA

Join Date: Apr 2008
Posts: 27
Default

Hi Jamie,

I had a look at "SRP*" entries in the following metadata file:
ftp://ftp-trace.ncbi.nlm.nih.gov/sra...Accessions.tab
Maybe this isn't the same as what you called "top-level" studies, but the SRA project entries should give an idea of distinct project uploads.

Broken down by "received" date, the counts I saw are as follows:
2008| 378
2009| 1129
2010| 3618
2011| 4872
2012| 7697
2013| 17142
2014| 8124

So, an acceleration in submissions in 2013 rather than a drop-off!

More speculation, but there could be a couple of things happening here:
- Are you sure that you had a recent update of the SQLite data dump from the SRAdb package?
- Do you know how often that SQLite file is updated by the SRAdb folks? Maybe you could look for the exact date of the last 2013 entry in your copy of the SQLite file?


Finally, I disagree with Bukowski's rationale above... there is just as much of a drive to deposit NGS data as there was for microarrays; you can't publish in most journals without submitting data to the SRA. And poorly curated this type of data may sometimes be, but it's often useful.
ShaunMahony is offline   Reply With Quote
Old 06-07-2014, 01:30 PM   #5
Bukowski
Senior Member
 
Location: Aberdeen, Scotland

Join Date: Jan 2010
Posts: 350
Default

Quote:
Originally Posted by ShaunMahony View Post
Finally, I disagree with Bukowski's rationale above... there is just as much of a drive to deposit NGS data as there was for microarrays; you can't publish in most journals without submitting data to the SRA. And poorly curated this type of data may sometimes be, but it's often useful.
I was being a little bit facetious I must admit, but I do wonder how that rate of deposition corresponds with deployment of machines and how divergent they are.

I don't work in academia, but I did when microarrays were at their peak. Pretty much every paper I'm credited on with arrays is in GEO. None of my NGS papers are in the SRA - but I'm in clinical genomics, so it might be a reflection on the privacy issues - but it is most certainly possible to publish, in high-quality journals, without releasing NGS data.
Bukowski is offline   Reply With Quote
Old 06-07-2014, 05:15 PM   #6
ShaunMahony
Member
 
Location: University Park, PA

Join Date: Apr 2008
Posts: 27
Default

Ah, I guess privacy issues do complicate things for clinical sequencing data. But are you telling me that you don't even submit variations to dbSNP or dbGap? I guess I should have clarified my statement to say it's not *supposed* to be possible to publish without submitting any described sequence data to public repositories. I know of very few journals that don't explicitly stipulate exactly this in their author guides. Whether they always enforce the rule is another story.
ShaunMahony is offline   Reply With Quote
Old 06-08-2014, 09:17 AM   #7
JamieWizard
Member
 
Location: London

Join Date: Sep 2013
Posts: 10
Default SRA study numbers from 2013 - Bioconductor response

Hi everyone,

Thank you all for your thoughts. My initial thought based on another query was that a large proportion of the undated records could potentially be from 2013 (in light of the increasing production of NGS data inspite of it's potential closure a while back).

I posted the question to the Biocondutor forum and have just recieved this significant reply which I am sharing below:


MESSAGE BELOW FORWARDED FROM BIOCONDUCTOR FORUM:

Hi all,

Regarding missing studies by submission_date for 2013 and 2014 in the
SRAdb SQLite database, I did some investigation and found the reason.
The metadata in the SRAdb is mainly parsed from the XML files of the
SRA submissions and it is true with the submission table. But I see
quite some submission xml files don't have submission date, e.g.

ftp://ftp-trace.ncbi.nih.gov/sra/Sub...157/SRA157949/

SRA157949.experiment.xml
SRA157949.submission.xml

So it seem all the study and submission records are there, but some
submission records just don't submission date. I am looking into the
possibility of adding dates for those records.

Jamie, thanks for the finding and I will keep you updated.

Jack


Thanks again,
Jamie
JamieWizard is offline   Reply With Quote
Old 06-09-2014, 12:16 AM   #8
Bukowski
Senior Member
 
Location: Aberdeen, Scotland

Join Date: Jan 2010
Posts: 350
Default

Quote:
Originally Posted by ShaunMahony View Post
Ah, I guess privacy issues do complicate things for clinical sequencing data. But are you telling me that you don't even submit variations to dbSNP or dbGap?
That's an interesting point, I feel that is largely up to my collaborators, as they are the people who are effectively responsible for dissseminating the data. I am sure that it ends up in HGMD, but I don't know about dbSNP - really they should be in ClinVar.
Bukowski is offline   Reply With Quote
Reply

Tags
bioconductor, sql, sra, studies, year

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:06 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO