Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Downloading 'RunInfo Table' from SRA Run Selector

    Hello,

    I would like to download the metadata for a given BioProject from the SRA. I am able to get exactly what I need by hitting the download 'RunInfo Table' through the SRA Run Selector web interface (example). It should be relatively straightforward to perform this action from the command line using "wget".

    By clicking on the 'RunInfo Table' button, the page loads the following address, which is stable link to download the information:



    BUT, I have no idea where that hash information is coming from. Can anyone help there?

    Alternatively, I've tried a series of efetch commands, but none provide me a '.tsv' (or '.csv' would be fine) of the complete BioProject metadata.

    This command provides only the information about sequencing:
    wget -O PRJNA308986.csv 'http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?save=efetch&db=sra&rettype=runinfo&term=PRJNA308986'

    This command provides the full BioProject information sought, but in an .xml format which I haven't been able to parse.

    wget -O PRJNA496337.xml 'http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?save=efetch&db=sra&rettype=bioproject&term=PRJNA496337'

    Thanks in advance,
    Roli

  • #2
    In general, for downloading NCBI data from the Unix command line, I recommend using Entrez Direct.

    Specifically, to download the runinfo table, you can use the following command:
    Code:
    esearch -db sra -q 'PRJNA308986' | efetch -format runinfo
    This will produce a comma separated table with the following fields:
    Code:
                      Run [  1]: SRR3108728
              ReleaseDate [  2]: 2017-02-16 00:00:00
                 LoadDate [  3]: 2016-01-21 03:15:18
                    spots [  4]: 98100
                    bases [  5]: 49246200
         spots_with_mates [  6]: 98100
                avgLength [  7]: 502
                  size_MB [  8]: 28
             AssemblyName [  9]: 
            download_path [ 10]: https://sra-download.ncbi.nlm.nih.gov/traces/sra37/SRR/003035/SRR3108728
               Experiment [ 11]: SRX1537041
              LibraryName [ 12]: mdbk110
          LibraryStrategy [ 13]: AMPLICON
         LibrarySelection [ 14]: PCR
            LibrarySource [ 15]: METAGENOMIC
            LibraryLayout [ 16]: PAIRED
               InsertSize [ 17]: 0
                InsertDev [ 18]: 0
                 Platform [ 19]: ILLUMINA
                    Model [ 20]: Illumina MiSeq
                 SRAStudy [ 21]: SRP068618
               BioProject [ 22]: PRJNA308986
          Study_Pubmed_id [ 23]: 
                ProjectID [ 24]: 308986
                   Sample [ 25]: SRS1253892
                BioSample [ 26]: SAMN04419133
               SampleType [ 27]: simple
                    TaxID [ 28]: 410658
           ScientificName [ 29]: soil metagenome
               SampleName [ 30]: mdbk110
             g1k_pop_code [ 31]: 
                   source [ 32]: 
       g1k_analysis_group [ 33]: 
               Subject_ID [ 34]: 
                      Sex [ 35]: 
                  Disease [ 36]: 
                    Tumor [ 37]: no
         Affection_Status [ 38]: 
             Analyte_Type [ 39]: 
        Histological_Type [ 40]: 
                Body_Site [ 41]: 
               CenterName [ 42]: UNIVERSITY OF MINNESOTA
               Submission [ 43]: SRA336468
    dbgap_study_accession [ 44]: 
                  Consent [ 45]: public
                  RunHash [ 46]: 4B63AAF2295927A2EAEB798FCF9FC7DA
                 ReadHash [ 47]: FB1226CB8B5FEBC85B053718D4C1BBFA
    You can download the same table in XML format by making a small change as follows:
    Code:
    esearch -db sra -q 'PRJNA308986' | efetch -format runinfo -mode xml
    You can then parse this XML using the command "xtract" that comes with the Entrez Direct tools to extract only specific columns of interest to you.

    Comment


    • #3
      In general, for downloading NCBI data from the Unix command line, I recommend using Entrez Direct.

      Specifically, to download the runinfo table, you can use the following command:
      Code:
      esearch -db sra -q 'PRJNA308986' | efetch -format runinfo
      This will produce a comma separated table with the following fields:
      Code:
                        Run [  1]: SRR3108728
                ReleaseDate [  2]: 2017-02-16 00:00:00
                   LoadDate [  3]: 2016-01-21 03:15:18
                      spots [  4]: 98100
                      bases [  5]: 49246200
           spots_with_mates [  6]: 98100
                  avgLength [  7]: 502
                    size_MB [  8]: 28
               AssemblyName [  9]: 
              download_path [ 10]: https://sra-download.ncbi.nlm.nih.gov/traces/sra37/SRR/003035/SRR3108728
                 Experiment [ 11]: SRX1537041
                LibraryName [ 12]: mdbk110
            LibraryStrategy [ 13]: AMPLICON
           LibrarySelection [ 14]: PCR
              LibrarySource [ 15]: METAGENOMIC
              LibraryLayout [ 16]: PAIRED
                 InsertSize [ 17]: 0
                  InsertDev [ 18]: 0
                   Platform [ 19]: ILLUMINA
                      Model [ 20]: Illumina MiSeq
                   SRAStudy [ 21]: SRP068618
                 BioProject [ 22]: PRJNA308986
            Study_Pubmed_id [ 23]: 
                  ProjectID [ 24]: 308986
                     Sample [ 25]: SRS1253892
                  BioSample [ 26]: SAMN04419133
                 SampleType [ 27]: simple
                      TaxID [ 28]: 410658
             ScientificName [ 29]: soil metagenome
                 SampleName [ 30]: mdbk110
               g1k_pop_code [ 31]: 
                     source [ 32]: 
         g1k_analysis_group [ 33]: 
                 Subject_ID [ 34]: 
                        Sex [ 35]: 
                    Disease [ 36]: 
                      Tumor [ 37]: no
           Affection_Status [ 38]: 
               Analyte_Type [ 39]: 
          Histological_Type [ 40]: 
                  Body_Site [ 41]: 
                 CenterName [ 42]: UNIVERSITY OF MINNESOTA
                 Submission [ 43]: SRA336468
      dbgap_study_accession [ 44]: 
                    Consent [ 45]: public
                    RunHash [ 46]: 4B63AAF2295927A2EAEB798FCF9FC7DA
                   ReadHash [ 47]: FB1226CB8B5FEBC85B053718D4C1BBFA
      You can download the same table in XML format by making a small change as follows:
      Code:
      esearch -db sra -q 'PRJNA308986' | efetch -format runinfo -mode xml
      You can then parse this XML using the command "xtract" that comes with the Entrez Direct tools to extract only specific columns of interest to you.

      Comment


      • #4
        Using wget to retrieve SRA RunInfo and AccList

        Here's an example of using `wget` to retrieve the SRA RunInfo and AccList from NCBI Sequence Read Archive.

        Code:
        # wget equivalent to:
        #   esearch -db sra -q "${study_id}" | efetch -format runinfo
        
        study_id=PRJNA308986
        db=sra
        
        #assemble the esearch URL
        base='https://eutils.ncbi.nlm.nih.gov/entrez/eutils/'
        
        # esearch for the project, using WebEnv/QueryKey for efetch
        data="`wget -qO- "${base}esearch.fcgi?db=${db}&term=${study_id}&usehistory=y"`"
        web=$(grep -oPm1 "(?<=<WebEnv>)[^<]+" <<< "${data}")
        key=$(grep -oPm1 "(?<=<QueryKey>)[^<]+" <<< "${data}")
        
        # efetch SRA RunInfo
        wget -qO "SraRunInfo-${study_id}.csv" "${base}efetch.fcgi?db=${db}&query_key=${key}&WebEnv=${web}&retmode=text&rettype=runinfo"
        
        # efetch SRA AccList
        wget -qO "SraAccList-${study_id}.txt" "${base}efetch.fcgi?db=${db}&query_key=${key}&WebEnv=${web}&retmode=text&rettype=acclist"

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin


          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM
        • seqadmin
          Techniques and Challenges in Conservation Genomics
          by seqadmin



          The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

          Avian Conservation
          Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
          03-08-2024, 10:41 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Yesterday, 06:37 PM
        0 responses
        7 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, Yesterday, 06:07 PM
        0 responses
        7 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 03-22-2024, 10:03 AM
        0 responses
        49 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 03-21-2024, 07:32 AM
        0 responses
        66 views
        0 likes
        Last Post seqadmin  
        Working...
        X