Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Uploading PacBio raw data to ENA SRA

    Surprisingly I was not able to find anything on Google about the details of uploading raw PacBio sequence reads to the European Nucleotide Archive (ENA), the EBI-EMBL twin of the Short Read Archive (SRA).

    http://www.ebi.ac.uk/ena/submit/read...bio_hd5_format just says:

    PacBio format

    PacBio data submissions are supported in the platform specific native format.

    One run consists of *.bax.h5, *.bas.h5 and xml files. Please note that these files must not be tarred.
    In our case I have a zip file with dozens of files under assorted folders. Thankfully https://github.com/PacificBioscience...rvice-provider explains this. Based on their example, I've marked the files I think I need to upload (update - not correct, see later):

    Code:
    /path/to/secondary/storage/2420294/0011
    ├── Analysis_Results
    │   ├── [B]m140415_143853_42175_c100635972550000001823121909121417_s1_p0.1.bax.h5 --> ENA[/B]
    │   ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.1.log
    │   ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.1.subreads.fasta
    │   ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.1.subreads.fastq
    │   ├── [B]m140415_143853_42175_c100635972550000001823121909121417_s1_p0.2.bax.h5 --> ENA[/B]
    │   ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.2.log
    │   ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.2.subreads.fasta
    │   ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.2.subreads.fastq
    │   ├── [B]m140415_143853_42175_c100635972550000001823121909121417_s1_p0.3.bax.h5 --> ENA[/B]
    │   ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.3.log
    │   ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.3.subreads.fasta
    │   ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.3.subreads.fastq
    │   ├── [B]m140415_143853_42175_c100635972550000001823121909121417_s1_p0.bas.h5 --> ENA[/B]
    │   ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.sts.csv
    │   └── [B]m140415_143853_42175_c100635972550000001823121909121417_s1_p0.sts.xml --> ENA[/B]
    ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.1.xfer.xml
    ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.2.xfer.xml
    ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.3.xfer.xml
    ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.mcd.h5
    └── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.metadata.xml
    However, when clicking though the ENA Webin forms, and you finally get to the spreadsheet-like view to upload the files, and pick PacBio:

    PacBio HDF5

    One PacBio HDF5 file is submitted for each run.
    It really only seems to want a single file... which puzzled me. However, the tool tip says:

    Please choose one of the following manifest files present in your drop box. A manifest file ( *.all ) contains all files ( *bas.h5, *.bax.h5 and *.xml ) and their MD5 checksums associated with a single PacBio run. The format of the manifest file must correspond to the output of the md5sum command.

    If your file is not listed below, it was either not found in your drop box or its extension was not recognized.
    So, I think that means you can create a plain text "manifest" using the md5sum command line tool, e.g.:

    Code:
    $ cd /path/to/secondary/storage/2420294/0011/Analysis_Results
    $ md5sum *bas.h5 *.bax.h5 *.xml > manifest.all
    But that would miss out the *.metadata.xml file which looks useful? (update - yes, they want that XML file in particular - see below). Could anyone who has done this help - or should I email the DataSub teams and report back here? Thanks!
    Last edited by maubp; 03-10-2016, 09:08 AM. Reason: Clarify given later information

  • #2
    You should submit the metadata.xml file because as I remember it is difficult (or impossible) to recreate and that file is needed to import/analyze data in SMRTportal.

    The *.h5 files you submit become available as is under the "Download" tab so people can get at the raw data. At least that is how things work in SRA.
    Last edited by GenoMax; 03-09-2016, 11:26 AM. Reason: Clarification about exact file

    Comment


    • #3
      When you say *.xml do you mean all of them (at both levels of the directory hierarchy)?

      Comment


      • #4
        I meant to specifically say metadata.xml (details of the files are described here: https://github.com/PacificBioscience...rvice-provider)

        Comment


        • #5
          Thanks - ENA are not clear but suspect you're right and they want the *.metadata.xml - and perhaps the *.sts.xml files too (summary statistics).

          Comment


          • #6
            I've emailed the EBI DataSubs team, and will post back once I know the answer.

            Comment


            • #7
              The DataSubs team replied for each PacBio SMRT cell run they want three *.bax.h5 files, one *.bas.h5 file, and one *.metadata.xml file.

              i.e. Something like this for the PacBio example above (using made up checksum values):

              Code:
              $ cat run_1_manifest.all
              7b382592c46607ec0348bf969ed8b01f m140415_143853_42175_c100635972550000001823121909121417_s1_p0.1.bax.h5
              2b912a574ad5e264f781ca495b0b5908 m140415_143853_42175_c100635972550000001823121909121417_s1_p0.2.bax.h5
              6c7c66e4e2aa1e5516f7d7c16b0ef8b2 m140415_143853_42175_c100635972550000001823121909121417_s1_p0.3.bax.h5
              3f6067c02aa643eb5d609197defc3baa m140415_143853_42175_c100635972550000001823121909121417_s1_p0.bas.h5
              c12eafa8bf1cc3c1548c1625d9edad7c m140415_143853_42175_c100635972550000001823121909121417_s1_p0.metadata.xml
              I've asked if I can share the full email here.

              Update:

              Jeena at the EBI Data Submissions team kindly allowed me to post her advice - note the screenshot shows the expected MD5 based manifest file on which I based the example above:

              Dear Peter,

              A Pac Bio run normally consists of 5 files. They are 3 bax.h5, 1 bas.h5, and the equally important metadata.xml file. If you use Webin you must create a manifest file as explained here:


              If you want to reference each file separately per run the please use the REST submission service:


              Here is a template for a pac bio run.


              Please let us know if you require more help. My colleague Marc is currently away but will be back in the office tomorrow and will be able to provide further help if needed.

              Kind regards,
              Jeena
              Last edited by maubp; 03-11-2016, 01:37 AM. Reason: Adding reply from ENA

              Comment


              • #8
                That makes sense.

                Are you also submitting fastq/fasta files that went into your analysis (since they would be generated after some filtering etc using SMRTportal or command line tools)?

                Do you know how ENA makes original files available for PacBio? On the page where they have fastq files?

                Comment


                • #9
                  I'm quite willing to, but unsure how they'd want that - I could upload the processed FASTQ as another run?

                  Comment


                  • #10
                    Originally posted by maubp View Post
                    I'm quite willing to, but unsure how they'd want that - I could upload the processed FASTQ as another run?
                    I wonder if NCBI SRA handles things the other way around. Submit "fastq" as main record and attach original *.h5 (which become available via the "Download" tab). As is the *.h5 files are not immediately useful unless they are going to be re-processed by the person downloading them (not everyone would want to or have the means to do that).

                    Comment


                    • #11
                      In this case from two SMRT cells I have one FASTQ file of filtered subreads used in the analysis, but I can easily split it up into one FASTQ file per run based on the read names.

                      Comment


                      • #12
                        Perhaps this is another question for ENA datasub team.

                        Having two separate records (one for fastq and other for *.h5 files) may be confusing. Having both in one record makes more sense but sounds like there is no direct way of doing that?

                        Edit: Unless ENA SRA is going to convert the *.h5 files and make fastq's from them. Again they would have to confirm that.
                        Last edited by GenoMax; 03-10-2016, 09:55 AM.

                        Comment


                        • #13
                          Reply from the ENA DataSubs team: Please submit the fastq or the native package but not both.

                          It looks like our first SMRT cell raw data has uploaded OK

                          Comment


                          • #14
                            You can make the SMRTportal (or command line) settings used to generate the filtered fastq files available in the methods/supplemental materials.

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Strategies for Sequencing Challenging Samples
                              by seqadmin


                              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                              03-22-2024, 06:39 AM
                            • seqadmin
                              Techniques and Challenges in Conservation Genomics
                              by seqadmin



                              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                              Avian Conservation
                              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                              03-08-2024, 10:41 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, Yesterday, 06:37 PM
                            0 responses
                            7 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, Yesterday, 06:07 PM
                            0 responses
                            7 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-22-2024, 10:03 AM
                            0 responses
                            49 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-21-2024, 07:32 AM
                            0 responses
                            66 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X