Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Searching, parsing and working with large GEO queries

    Ok, so I'm about to start the largest project I've undertaken, in terms of data handling and analysis. In short, the project is about finding public datasets available at the Gene Expression Omnibus (GEO) that adhere to a set of filters, download all the raw data (i.e. FASTQ files) and perform a set of analyses on the data, while maintaining metadata-to-result connectivity. I've done some googling... but I'm at a loss as to how to go about it.

    What I'm looking for is RNA-seq data for human cell lines, but only for a specific 1000 cell lines. I want to find the FASTQ files + metadata for these datasets and perform analyses on them. There's several steps, some of which I have ideas for, some of which I have no clue how to proceed.

    First, the GEO query itself. I can easily search for RNA-seq data for human ("expression profiling by high throughput sequencing"[DataSet Type]) OR "non coding rna profiling by high throughput sequencing"[DataSet Type]) AND Homo sapiens[Organism], 3298 GEO series), but I'm not sure how to search for cell lines only. Just adding "cell line" to [Any Field] seems too simple, and might miss GEO series. There is a field inside the GEO SOFT files called "Sample_characteristics_ch1 = <value>", which can be set to "cell line: <cell line name>". (No, I'm not sure exactly what the "_ch1" part means...) I was thinking downloading all the SOFT files for the series above followed by a filtering on sample characteristics as including "cell line". The first step would then be:
    • 1) Get identifiers for all the series in the query, download all the SOFT files and filter them to include cell lines

    The second step would be simple in comparison:
    • 2) Filter the results to only contain cell lines that are included in the list of the specific 1000 cell lines.

    Then comes another big question mark for me: how do I go from this list, containing all the info available in the SOFT file, to downloading all the correponding FASTQ files from SRA? The SRX ID is available in the SOFT file, but I think that fastq-dump requires SRR IDs... So:
    • 3) Find each SRR associated with all the SRXs in each SOFT file from the list above.
    • 4) Read the appropriate metadata to see if the data is paired-end or single-end.
    • 5) Download the data using fastq-dump as appropriate.

    Is this something feasible? Am I going about the problem the right way? Maybe I'm doing it all wrong and there's a simply solution that I'm not seeing. How would you do this, given the project outline? A big problem I foresee (other than not actually having a good idea how to perform all the steps) is how to keep the metadata properly connected to the raw data... It's (of course) quite important to be able to stratify the end results based on the metadata, as that is a big part of the reason why I want to do this project.

    Ideas, suggestions, tips? Fully fleshed-out solutions are also acceptable ;-)

  • #2
    I managed to solve it myself, so I'm posting the solution if anybody else happens upon the same problem. I first used the NCBI Entrez Direct CLI (http://www.ncbi.nlm.nih.gov/books/NBK179288/) to query GEO and find all the available RNA-seq data and it's GSE accession numbers. I parsed this list using the GEOquery R package https://bioconductor.org/packages/re.../GEOquery.html and downloaded all the corresponding SOFT files, from which I performed additional filtering, yielding a list of SRX accessions. I converted these to SRR accessions using SRAdb https://bioconductor.org/packages/re...tml/SRAdb.html, followed by downloading the FASTQ raw data using fastq-dump from sra-tools.

    It took a while to parse all the SOFT files with R, but seeing as I had to do the additional filtering on criteria that's only found in the SOFT files rather than in the GEO query itself, that's what worked in my specific case. It'd be much faster if I could filter more in the starting query itself, rather than after having to download all the additional and a lot of unnecessary (it turns out) metadata.

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Current Approaches to Protein Sequencing
      by seqadmin


      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
      04-04-2024, 04:25 PM
    • seqadmin
      Strategies for Sequencing Challenging Samples
      by seqadmin


      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
      03-22-2024, 06:39 AM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, 04-11-2024, 12:08 PM
    0 responses
    30 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-10-2024, 10:19 PM
    0 responses
    32 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-10-2024, 09:21 AM
    0 responses
    28 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-04-2024, 09:00 AM
    0 responses
    53 views
    0 likes
    Last Post seqadmin  
    Working...
    X