Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
MAF to VCF of ovarian TCGA data Seq^3 Bioinformatics 1 04-06-2013 07:48 AM
TCGA RNA-Seq vinay052003 Bioinformatics 2 02-27-2012 12:12 PM
Customer Details Visible Dario1984 Illumina/Solexa 4 01-29-2012 03:53 PM
TCGA : capture vs. exome question? Richard Finney Genomic Resequencing 1 09-24-2011 10:50 PM
TCGA dbgap: where's the bams? Richard Finney Bioinformatics 0 06-13-2011 09:18 AM

Thread Tools
Old 04-30-2012, 04:05 AM   #1
Junior Member
Location: Helsinki, Finland

Join Date: Sep 2011
Posts: 3
Default TCGA data analysis details

Hi everybody,

We are working heavily with TCGA datasets but we are also a bit stunned about the lack of documentation (maybe we just haven't found it).

There are several highly relevant questions practically preventing TCGA data usage unless these can be solved. I am now just listing these questions and hoping that we could together gather those bits and pieces of information required.

1) Are BAM files available through dbGap (until this day) exactly the ones which are used to generate MAF (mutation annotation format) files available from TCGA bulk download site?

2) Where is documented the exact version of the reference genome used in a) TCGA BAM file generation b) behind the MAF files

3) It seems that for any given cancer dataset TCGA provides SRA files for some patients, BAM files for some patients and both files for some patients. In the case of BAM files various aligners have been used even within within single dataset and as pointed out above, finding hard evidence of the exact ref genome version is difficult.

In other words, to generate consistent dataset of all patients of this kind of dataset one needs a) process SRA files by using known and recent all the way to mutation calls b) extract raw sequence data out of old BAM files and realign against known and recent genome version and call mutations. Latter one needs tools that are able to extract sequences from TCGA BAM files (any suggestions or experience on these?). And in overall, have anybody processed some TCGA dataset all the way from raw data to mutations by using recent genome? If so, can you please share the details?

4) TCGA bulk download site provides MAF (mutation annotation format) files listing tens to hundreds of mutations per sample. Any attempt to use VarScan, UnifiedGenotyper etc... to call mutations from the BAM files of exact same samples provides easily thousands of mutations with no clear flags how to filter data to end up into tens or hundreds of mutations. Does anybody know how TCGA has formed these MAF files from the BAM files they provide?

I realize that studies are most likely analyzed in separate places, thus requiring data analysis solutions study by study. But it doesn't change the fact that these details are needed in order to really use TCGA data. Unfortunately the publications made out from these studies neither explicitly claim or disclaim that function parameters, ref genome versions or the entire data analysis pipeline described in the paper would be the one generated the data provided from TCGA site.
skilpinen is offline   Reply With Quote
Old 03-05-2013, 11:57 AM   #2
Location: New York

Join Date: Jul 2012
Posts: 12

I seem to be in a similar predicament. I am unable to extract some necessary data from MAF files, such as the forward tumor reads and the forward normal read counts.
Philcolson is offline   Reply With Quote
Old 03-07-2013, 02:17 PM   #3
Location: USA

Join Date: Mar 2010
Posts: 50

VCF files should soon begin to trickle out for many TCGA cases.

You can find them using the "Find Archives" GUI

or via the protected access data portal.

The DP4 will provide (high quality) strand specific read counts for reads supporting reference and variant alleles:

##FORMAT=<ID=DP4,Number=4,Type=Integer,Description="# high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">

Here is the latest TCGA VCF filespec:
m_two is offline   Reply With Quote
Old 03-07-2013, 02:55 PM   #4
Richard Finney
Senior Member
Location: bethesda

Join Date: Feb 2009
Posts: 700

Just want to clarify (if any of this is wrong, please correct me):

The MAF files are (mostly) public. Level 3 MAF files are verified mutations. level 2 may contain unvalidated (source: )

The VCF files are "protected".

The 2.3 MAF columns are

$ head -1 genome.wustl.edu_BRCA.IlluminaGA_DNASeq.Level_2.3.2.0.somatic.maf
#version 2.3
$ head -2 genome.wustl.edu_BRCA.IlluminaGA_DNASeq.Level_2.3.2.0.somatic.maf | grep -v "#" | tr "\t" "\n" | nl
1 Hugo_Symbol
2 Entrez_Gene_Id
3 Center
4 NCBI_Build
5 Chromosome
6 Start_position
7 End_position
8 Strand
9 Variant_Classification
10 Variant_Type
11 Reference_Allele
12 Tumor_Seq_Allele1
13 Tumor_Seq_Allele2
14 dbSNP_RS
15 dbSNP_Val_Status
16 Tumor_Sample_Barcode
17 Matched_Norm_Sample_Barcode
18 Match_Norm_Seq_Allele1
19 Match_Norm_Seq_Allele2
20 Tumor_Validation_Allele1
21 Tumor_Validation_Allele2
22 Match_Norm_Validation_Allele1
23 Match_Norm_Validation_Allele2
24 Verification_Status
25 Validation_Status
26 Mutation_Status
27 Sequencing_Phase
28 Sequence_Source
29 Validation_Method
30 Score
31 BAM_file
32 Sequencer
33 Tumor_Sample_UUID
34 Matched_Norm_Sample_UUID
35 chromosome_name_WU
36 start_WU
37 stop_WU
38 reference_WU
39 variant_WU
40 type_WU
41 gene_name_WU
42 transcript_name_WU
43 transcript_species_WU
44 transcript_source_WU
45 transcript_version_WU
46 strand_WU
47 transcript_status_WU
48 trv_type_WU
49 c_position_WU
50 amino_acid_change_WU
51 ucsc_cons_WU
52 domain_WU
53 all_domains_WU
54 deletion_substructures_WU
55 annotation_errors_WU

VCF files will contain the DP4 (fwd/rev) information and are not in the MAF files.

Easiest way for me to deal with TCGA data warehouse is just spider the site and get the file names into a file for using wget.

Last edited by Richard Finney; 03-07-2013 at 02:58 PM.
Richard Finney is offline   Reply With Quote
Old 03-07-2013, 03:08 PM   #5
Location: USA

Join Date: Mar 2010
Posts: 50

For TCGA there should only be Level 2 MAF files. Level 3 sequence data would involve significantly mutated genes, domains, coding elements, and mutation hotspots.

The official MAF filespec headers can be found at
Only the first 34 columns are defined in the filespec.

TCGA VCF files should contain the AD or DP4 (fwd/rev) information which is not in most MAF files. The exact contents will depend on software support.
m_two is offline   Reply With Quote

maf, tcga

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

All times are GMT -8. The time now is 05:09 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO