Ok, so I'm about to start the largest project I've undertaken, in terms of data handling and analysis. In short, the project is about finding public datasets available at the Gene Expression Omnibus (GEO) that adhere to a set of filters, download all the raw data (i.e. FASTQ files) and perform a set of analyses on the data, while maintaining metadata-to-result connectivity. I've done some googling... but I'm at a loss as to how to go about it.
What I'm looking for is RNA-seq data for human cell lines, but only for a specific 1000 cell lines. I want to find the FASTQ files + metadata for these datasets and perform analyses on them. There's several steps, some of which I have ideas for, some of which I have no clue how to proceed.
First, the GEO query itself. I can easily search for RNA-seq data for human ("expression profiling by high throughput sequencing"[DataSet Type]) OR "non coding rna profiling by high throughput sequencing"[DataSet Type]) AND Homo sapiens[Organism], 3298 GEO series), but I'm not sure how to search for cell lines only. Just adding "cell line" to [Any Field] seems too simple, and might miss GEO series. There is a field inside the GEO SOFT files called "Sample_characteristics_ch1 = <value>", which can be set to "cell line: <cell line name>". (No, I'm not sure exactly what the "_ch1" part means...) I was thinking downloading all the SOFT files for the series above followed by a filtering on sample characteristics as including "cell line". The first step would then be:
The second step would be simple in comparison:
Then comes another big question mark for me: how do I go from this list, containing all the info available in the SOFT file, to downloading all the correponding FASTQ files from SRA? The SRX ID is available in the SOFT file, but I think that fastq-dump requires SRR IDs... So:
Is this something feasible? Am I going about the problem the right way? Maybe I'm doing it all wrong and there's a simply solution that I'm not seeing. How would you do this, given the project outline? A big problem I foresee (other than not actually having a good idea how to perform all the steps) is how to keep the metadata properly connected to the raw data... It's (of course) quite important to be able to stratify the end results based on the metadata, as that is a big part of the reason why I want to do this project.
Ideas, suggestions, tips? Fully fleshed-out solutions are also acceptable ;-)
What I'm looking for is RNA-seq data for human cell lines, but only for a specific 1000 cell lines. I want to find the FASTQ files + metadata for these datasets and perform analyses on them. There's several steps, some of which I have ideas for, some of which I have no clue how to proceed.
First, the GEO query itself. I can easily search for RNA-seq data for human ("expression profiling by high throughput sequencing"[DataSet Type]) OR "non coding rna profiling by high throughput sequencing"[DataSet Type]) AND Homo sapiens[Organism], 3298 GEO series), but I'm not sure how to search for cell lines only. Just adding "cell line" to [Any Field] seems too simple, and might miss GEO series. There is a field inside the GEO SOFT files called "Sample_characteristics_ch1 = <value>", which can be set to "cell line: <cell line name>". (No, I'm not sure exactly what the "_ch1" part means...) I was thinking downloading all the SOFT files for the series above followed by a filtering on sample characteristics as including "cell line". The first step would then be:
- 1) Get identifiers for all the series in the query, download all the SOFT files and filter them to include cell lines
The second step would be simple in comparison:
- 2) Filter the results to only contain cell lines that are included in the list of the specific 1000 cell lines.
Then comes another big question mark for me: how do I go from this list, containing all the info available in the SOFT file, to downloading all the correponding FASTQ files from SRA? The SRX ID is available in the SOFT file, but I think that fastq-dump requires SRR IDs... So:
- 3) Find each SRR associated with all the SRXs in each SOFT file from the list above.
- 4) Read the appropriate metadata to see if the data is paired-end or single-end.
- 5) Download the data using fastq-dump as appropriate.
Is this something feasible? Am I going about the problem the right way? Maybe I'm doing it all wrong and there's a simply solution that I'm not seeing. How would you do this, given the project outline? A big problem I foresee (other than not actually having a good idea how to perform all the steps) is how to keep the metadata properly connected to the raw data... It's (of course) quite important to be able to stratify the end results based on the metadata, as that is a big part of the reason why I want to do this project.
Ideas, suggestions, tips? Fully fleshed-out solutions are also acceptable ;-)
Comment