TCGA data seems to now live at the Geomics Data Commons: https://gdc.nci.nih.gov/
The data is "accessible" via a RESTful interface.
It appears to be a mishmash jammed into some kind of NOSQL database(?). Figuring out, for instance, what rna-seq bam file ids and "legacy" barcodes are associated with TCGA kidney disease (KICH, KIRCH, KIRP), is a real pain.
What I want is just a dump of all the data into a flat file (or a few flat files) ... which I can then parse using jq, sed, awk, grep, custom C and Python programs, etc.
So ... has anybody come up with a script to suck all the "meta" data from GDC via their API?
The data is "accessible" via a RESTful interface.
It appears to be a mishmash jammed into some kind of NOSQL database(?). Figuring out, for instance, what rna-seq bam file ids and "legacy" barcodes are associated with TCGA kidney disease (KICH, KIRCH, KIRP), is a real pain.
What I want is just a dump of all the data into a flat file (or a few flat files) ... which I can then parse using jq, sed, awk, grep, custom C and Python programs, etc.
So ... has anybody come up with a script to suck all the "meta" data from GDC via their API?
Comment