Hi everybody,
We are working heavily with TCGA datasets but we are also a bit stunned about the lack of documentation (maybe we just haven't found it).
There are several highly relevant questions practically preventing TCGA data usage unless these can be solved. I am now just listing these questions and hoping that we could together gather those bits and pieces of information required.
1) Are BAM files available through dbGap (until this day) exactly the ones which are used to generate MAF (mutation annotation format) files available from TCGA bulk download site?
2) Where is documented the exact version of the reference genome used in a) TCGA BAM file generation b) behind the MAF files
3) It seems that for any given cancer dataset TCGA provides SRA files for some patients, BAM files for some patients and both files for some patients. In the case of BAM files various aligners have been used even within within single dataset and as pointed out above, finding hard evidence of the exact ref genome version is difficult.
In other words, to generate consistent dataset of all patients of this kind of dataset one needs a) process SRA files by using known and recent all the way to mutation calls b) extract raw sequence data out of old BAM files and realign against known and recent genome version and call mutations. Latter one needs tools that are able to extract sequences from TCGA BAM files (any suggestions or experience on these?). And in overall, have anybody processed some TCGA dataset all the way from raw data to mutations by using recent genome? If so, can you please share the details?
4) TCGA bulk download site provides MAF (mutation annotation format) files listing tens to hundreds of mutations per sample. Any attempt to use VarScan, UnifiedGenotyper etc... to call mutations from the BAM files of exact same samples provides easily thousands of mutations with no clear flags how to filter data to end up into tens or hundreds of mutations. Does anybody know how TCGA has formed these MAF files from the BAM files they provide?
I realize that studies are most likely analyzed in separate places, thus requiring data analysis solutions study by study. But it doesn't change the fact that these details are needed in order to really use TCGA data. Unfortunately the publications made out from these studies neither explicitly claim or disclaim that function parameters, ref genome versions or the entire data analysis pipeline described in the paper would be the one generated the data provided from TCGA site.
We are working heavily with TCGA datasets but we are also a bit stunned about the lack of documentation (maybe we just haven't found it).
There are several highly relevant questions practically preventing TCGA data usage unless these can be solved. I am now just listing these questions and hoping that we could together gather those bits and pieces of information required.
1) Are BAM files available through dbGap (until this day) exactly the ones which are used to generate MAF (mutation annotation format) files available from TCGA bulk download site?
2) Where is documented the exact version of the reference genome used in a) TCGA BAM file generation b) behind the MAF files
3) It seems that for any given cancer dataset TCGA provides SRA files for some patients, BAM files for some patients and both files for some patients. In the case of BAM files various aligners have been used even within within single dataset and as pointed out above, finding hard evidence of the exact ref genome version is difficult.
In other words, to generate consistent dataset of all patients of this kind of dataset one needs a) process SRA files by using known and recent all the way to mutation calls b) extract raw sequence data out of old BAM files and realign against known and recent genome version and call mutations. Latter one needs tools that are able to extract sequences from TCGA BAM files (any suggestions or experience on these?). And in overall, have anybody processed some TCGA dataset all the way from raw data to mutations by using recent genome? If so, can you please share the details?
4) TCGA bulk download site provides MAF (mutation annotation format) files listing tens to hundreds of mutations per sample. Any attempt to use VarScan, UnifiedGenotyper etc... to call mutations from the BAM files of exact same samples provides easily thousands of mutations with no clear flags how to filter data to end up into tens or hundreds of mutations. Does anybody know how TCGA has formed these MAF files from the BAM files they provide?
I realize that studies are most likely analyzed in separate places, thus requiring data analysis solutions study by study. But it doesn't change the fact that these details are needed in order to really use TCGA data. Unfortunately the publications made out from these studies neither explicitly claim or disclaim that function parameters, ref genome versions or the entire data analysis pipeline described in the paper would be the one generated the data provided from TCGA site.
Comment