SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Pacific Biosciences



Similar Threads
Thread Thread Starter Forum Replies Last Post
Submit Illumina data to Dryad, SRA, ENA...? JonB General 1 02-18-2016 10:23 AM
PacBio data - problem with SRA toolkit Retro Pacific Biosciences 7 12-04-2015 05:56 AM
Import raw PacBio data from *.bax.h5 files reubennowell Pacific Biosciences 4 10-12-2015 09:02 AM
How to work with a PacBio sra file? ymc Pacific Biosciences 9 04-15-2015 09:04 AM
How to find library fragment size for ENA raw sequence data JIrish Bioinformatics 0 12-16-2012 11:15 AM

Reply
 
Thread Tools
Old 03-09-2016, 08:53 AM   #1
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,541
Question Uploading PacBio raw data to ENA SRA

Surprisingly I was not able to find anything on Google about the details of uploading raw PacBio sequence reads to the European Nucleotide Archive (ENA), the EBI-EMBL twin of the Short Read Archive (SRA).

http://www.ebi.ac.uk/ena/submit/read...bio_hd5_format just says:

Quote:
PacBio format

PacBio data submissions are supported in the platform specific native format.

One run consists of *.bax.h5, *.bas.h5 and xml files. Please note that these files must not be tarred.
In our case I have a zip file with dozens of files under assorted folders. Thankfully https://github.com/PacificBioscience...rvice-provider explains this. Based on their example, I've marked the files I think I need to upload (update - not correct, see later):

Code:
/path/to/secondary/storage/2420294/0011
├── Analysis_Results
│   ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.1.bax.h5 --> ENA
│   ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.1.log
│   ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.1.subreads.fasta
│   ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.1.subreads.fastq
│   ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.2.bax.h5 --> ENA
│   ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.2.log
│   ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.2.subreads.fasta
│   ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.2.subreads.fastq
│   ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.3.bax.h5 --> ENA
│   ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.3.log
│   ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.3.subreads.fasta
│   ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.3.subreads.fastq
│   ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.bas.h5 --> ENA
│   ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.sts.csv
│   └── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.sts.xml --> ENA
├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.1.xfer.xml
├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.2.xfer.xml
├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.3.xfer.xml
├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.mcd.h5
└── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.metadata.xml
However, when clicking though the ENA Webin forms, and you finally get to the spreadsheet-like view to upload the files, and pick PacBio:

Quote:
PacBio HDF5

One PacBio HDF5 file is submitted for each run.
It really only seems to want a single file... which puzzled me. However, the tool tip says:

Quote:
Please choose one of the following manifest files present in your drop box. A manifest file ( *.all ) contains all files ( *bas.h5, *.bax.h5 and *.xml ) and their MD5 checksums associated with a single PacBio run. The format of the manifest file must correspond to the output of the md5sum command.

If your file is not listed below, it was either not found in your drop box or its extension was not recognized.
So, I think that means you can create a plain text "manifest" using the md5sum command line tool, e.g.:

Code:
$ cd /path/to/secondary/storage/2420294/0011/Analysis_Results
$ md5sum *bas.h5 *.bax.h5 *.xml > manifest.all
But that would miss out the *.metadata.xml file which looks useful? (update - yes, they want that XML file in particular - see below). Could anyone who has done this help - or should I email the DataSub teams and report back here? Thanks!

Last edited by maubp; 03-10-2016 at 08:08 AM. Reason: Clarify given later information
maubp is offline   Reply With Quote
Old 03-09-2016, 09:03 AM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,909
Default

You should submit the metadata.xml file because as I remember it is difficult (or impossible) to recreate and that file is needed to import/analyze data in SMRTportal.

The *.h5 files you submit become available as is under the "Download" tab so people can get at the raw data. At least that is how things work in SRA.

Last edited by GenoMax; 03-09-2016 at 10:26 AM. Reason: Clarification about exact file
GenoMax is offline   Reply With Quote
Old 03-09-2016, 09:32 AM   #3
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,541
Default

When you say *.xml do you mean all of them (at both levels of the directory hierarchy)?
maubp is offline   Reply With Quote
Old 03-09-2016, 09:39 AM   #4
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,909
Default

I meant to specifically say metadata.xml (details of the files are described here: https://github.com/PacificBioscience...rvice-provider)
GenoMax is offline   Reply With Quote
Old 03-09-2016, 11:47 AM   #5
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,541
Default

Thanks - ENA are not clear but suspect you're right and they want the *.metadata.xml - and perhaps the *.sts.xml files too (summary statistics).
maubp is offline   Reply With Quote
Old 03-10-2016, 12:20 AM   #6
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,541
Default

I've emailed the EBI DataSubs team, and will post back once I know the answer.
maubp is offline   Reply With Quote
Old 03-10-2016, 08:06 AM   #7
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,541
Default

The DataSubs team replied for each PacBio SMRT cell run they want three *.bax.h5 files, one *.bas.h5 file, and one *.metadata.xml file.

i.e. Something like this for the PacBio example above (using made up checksum values):

Code:
$ cat run_1_manifest.all
7b382592c46607ec0348bf969ed8b01f m140415_143853_42175_c100635972550000001823121909121417_s1_p0.1.bax.h5
2b912a574ad5e264f781ca495b0b5908 m140415_143853_42175_c100635972550000001823121909121417_s1_p0.2.bax.h5
6c7c66e4e2aa1e5516f7d7c16b0ef8b2 m140415_143853_42175_c100635972550000001823121909121417_s1_p0.3.bax.h5
3f6067c02aa643eb5d609197defc3baa m140415_143853_42175_c100635972550000001823121909121417_s1_p0.bas.h5
c12eafa8bf1cc3c1548c1625d9edad7c m140415_143853_42175_c100635972550000001823121909121417_s1_p0.metadata.xml
I've asked if I can share the full email here.

Update:

Jeena at the EBI Data Submissions team kindly allowed me to post her advice - note the screenshot shows the expected MD5 based manifest file on which I based the example above:

Quote:
Dear Peter,

A Pac Bio run normally consists of 5 files. They are 3 bax.h5, 1 bas.h5, and the equally important metadata.xml file. If you use Webin you must create a manifest file as explained here:
http://www.ebi.ac.uk/~mrosello/FAQs/...ac_bio_run.png

If you want to reference each file separately per run the please use the REST submission service:
http://www.ebi.ac.uk/ena/submit/programmatic-submission

Here is a template for a pac bio run.
http://www.ebi.ac.uk/~mrosello/xml_t...ac_bio/run.xml

Please let us know if you require more help. My colleague Marc is currently away but will be back in the office tomorrow and will be able to provide further help if needed.

Kind regards,
Jeena

Last edited by maubp; 03-11-2016 at 12:37 AM. Reason: Adding reply from ENA
maubp is offline   Reply With Quote
Old 03-10-2016, 08:11 AM   #8
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,909
Default

That makes sense.

Are you also submitting fastq/fasta files that went into your analysis (since they would be generated after some filtering etc using SMRTportal or command line tools)?

Do you know how ENA makes original files available for PacBio? On the page where they have fastq files?
GenoMax is offline   Reply With Quote
Old 03-10-2016, 08:15 AM   #9
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,541
Default

I'm quite willing to, but unsure how they'd want that - I could upload the processed FASTQ as another run?
maubp is offline   Reply With Quote
Old 03-10-2016, 08:18 AM   #10
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,909
Default

Quote:
Originally Posted by maubp View Post
I'm quite willing to, but unsure how they'd want that - I could upload the processed FASTQ as another run?
I wonder if NCBI SRA handles things the other way around. Submit "fastq" as main record and attach original *.h5 (which become available via the "Download" tab). As is the *.h5 files are not immediately useful unless they are going to be re-processed by the person downloading them (not everyone would want to or have the means to do that).
GenoMax is offline   Reply With Quote
Old 03-10-2016, 08:24 AM   #11
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,541
Default

In this case from two SMRT cells I have one FASTQ file of filtered subreads used in the analysis, but I can easily split it up into one FASTQ file per run based on the read names.
maubp is offline   Reply With Quote
Old 03-10-2016, 08:29 AM   #12
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,909
Default

Perhaps this is another question for ENA datasub team.

Having two separate records (one for fastq and other for *.h5 files) may be confusing. Having both in one record makes more sense but sounds like there is no direct way of doing that?

Edit: Unless ENA SRA is going to convert the *.h5 files and make fastq's from them. Again they would have to confirm that.

Last edited by GenoMax; 03-10-2016 at 08:55 AM.
GenoMax is offline   Reply With Quote
Old 03-11-2016, 05:34 AM   #13
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,541
Default

Reply from the ENA DataSubs team: Please submit the fastq or the native package but not both.

It looks like our first SMRT cell raw data has uploaded OK
maubp is offline   Reply With Quote
Old 03-11-2016, 06:07 AM   #14
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,909
Default

You can make the SMRTportal (or command line) settings used to generate the filtered fastq files available in the methods/supplemental materials.
GenoMax is offline   Reply With Quote
Reply

Tags
ena, pacbio, sra, upload, webin

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:31 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO