Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • skilpinen
    Junior Member
    • Sep 2011
    • 3

    TCGA data analysis details

    Hi everybody,

    We are working heavily with TCGA datasets but we are also a bit stunned about the lack of documentation (maybe we just haven't found it).

    There are several highly relevant questions practically preventing TCGA data usage unless these can be solved. I am now just listing these questions and hoping that we could together gather those bits and pieces of information required.

    1) Are BAM files available through dbGap (until this day) exactly the ones which are used to generate MAF (mutation annotation format) files available from TCGA bulk download site?

    2) Where is documented the exact version of the reference genome used in a) TCGA BAM file generation b) behind the MAF files

    3) It seems that for any given cancer dataset TCGA provides SRA files for some patients, BAM files for some patients and both files for some patients. In the case of BAM files various aligners have been used even within within single dataset and as pointed out above, finding hard evidence of the exact ref genome version is difficult.

    In other words, to generate consistent dataset of all patients of this kind of dataset one needs a) process SRA files by using known and recent all the way to mutation calls b) extract raw sequence data out of old BAM files and realign against known and recent genome version and call mutations. Latter one needs tools that are able to extract sequences from TCGA BAM files (any suggestions or experience on these?). And in overall, have anybody processed some TCGA dataset all the way from raw data to mutations by using recent genome? If so, can you please share the details?

    4) TCGA bulk download site provides MAF (mutation annotation format) files listing tens to hundreds of mutations per sample. Any attempt to use VarScan, UnifiedGenotyper etc... to call mutations from the BAM files of exact same samples provides easily thousands of mutations with no clear flags how to filter data to end up into tens or hundreds of mutations. Does anybody know how TCGA has formed these MAF files from the BAM files they provide?

    I realize that studies are most likely analyzed in separate places, thus requiring data analysis solutions study by study. But it doesn't change the fact that these details are needed in order to really use TCGA data. Unfortunately the publications made out from these studies neither explicitly claim or disclaim that function parameters, ref genome versions or the entire data analysis pipeline described in the paper would be the one generated the data provided from TCGA site.
  • Philcolson
    Member
    • Jul 2012
    • 12

    #2
    I seem to be in a similar predicament. I am unable to extract some necessary data from MAF files, such as the forward tumor reads and the forward normal read counts.

    Comment

    • m_two
      Member
      • Mar 2010
      • 50

      #3
      VCF files should soon begin to trickle out for many TCGA cases.

      You can find them using the "Find Archives" GUI
      The Cancer Genome Atlas (TCGA) is a landmark cancer genomics program that sequenced and molecularly characterized over 11,000 cases of primary cancer samples. Learn more about how the program transformed the cancer research community and beyond.


      or via the protected access data portal.
      The Cancer Genome Atlas (TCGA) is a landmark cancer genomics program that sequenced and molecularly characterized over 11,000 cases of primary cancer samples. Learn more about how the program transformed the cancer research community and beyond.


      The DP4 will provide (high quality) strand specific read counts for reads supporting reference and variant alleles:

      ##FORMAT=<ID=DP4,Number=4,Type=Integer,Description="# high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">


      Here is the latest TCGA VCF filespec:

      Comment

      • Richard Finney
        Senior Member
        • Feb 2009
        • 701

        #4
        Just want to clarify (if any of this is wrong, please correct me):

        The MAF files are (mostly) public. Level 3 MAF files are verified mutations. level 2 may contain unvalidated (source: https://wiki.nci.nih.gov/display/TCG...ion-Validation )

        The VCF files are "protected".

        The 2.3 MAF columns are


        $ head -1 genome.wustl.edu_BRCA.IlluminaGA_DNASeq.Level_2.3.2.0.somatic.maf
        #version 2.3
        $ head -2 genome.wustl.edu_BRCA.IlluminaGA_DNASeq.Level_2.3.2.0.somatic.maf | grep -v "#" | tr "\t" "\n" | nl
        1 Hugo_Symbol
        2 Entrez_Gene_Id
        3 Center
        4 NCBI_Build
        5 Chromosome
        6 Start_position
        7 End_position
        8 Strand
        9 Variant_Classification
        10 Variant_Type
        11 Reference_Allele
        12 Tumor_Seq_Allele1
        13 Tumor_Seq_Allele2
        14 dbSNP_RS
        15 dbSNP_Val_Status
        16 Tumor_Sample_Barcode
        17 Matched_Norm_Sample_Barcode
        18 Match_Norm_Seq_Allele1
        19 Match_Norm_Seq_Allele2
        20 Tumor_Validation_Allele1
        21 Tumor_Validation_Allele2
        22 Match_Norm_Validation_Allele1
        23 Match_Norm_Validation_Allele2
        24 Verification_Status
        25 Validation_Status
        26 Mutation_Status
        27 Sequencing_Phase
        28 Sequence_Source
        29 Validation_Method
        30 Score
        31 BAM_file
        32 Sequencer
        33 Tumor_Sample_UUID
        34 Matched_Norm_Sample_UUID
        35 chromosome_name_WU
        36 start_WU
        37 stop_WU
        38 reference_WU
        39 variant_WU
        40 type_WU
        41 gene_name_WU
        42 transcript_name_WU
        43 transcript_species_WU
        44 transcript_source_WU
        45 transcript_version_WU
        46 strand_WU
        47 transcript_status_WU
        48 trv_type_WU
        49 c_position_WU
        50 amino_acid_change_WU
        51 ucsc_cons_WU
        52 domain_WU
        53 all_domains_WU
        54 deletion_substructures_WU
        55 annotation_errors_WU


        VCF files will contain the DP4 (fwd/rev) information and are not in the MAF files.


        Easiest way for me to deal with TCGA data warehouse is just spider the site and get the file names into a file for using wget.
        Last edited by Richard Finney; 03-07-2013, 02:58 PM.

        Comment

        • m_two
          Member
          • Mar 2010
          • 50

          #5
          For TCGA there should only be Level 2 MAF files. Level 3 sequence data would involve significantly mutated genes, domains, coding elements, and mutation hotspots.

          The official MAF filespec headers can be found at http://goo.gl/6Mv1T.
          Only the first 34 columns are defined in the filespec.

          TCGA VCF files should contain the AD or DP4 (fwd/rev) information which is not in most MAF files. The exact contents will depend on software support.

          Comment

          Latest Articles

          Collapse

          • SEQadmin2
            Nine Things a Sample Prep Scientist Thinks About Before Sequencing
            by SEQadmin2


            I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


            Here are nine questions we think about, in roughly the order they matter, before...
            Today, 07:11 AM
          • SEQadmin2
            From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
            by SEQadmin2


            Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


            The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
            ...
            06-02-2026, 10:05 AM
          • SEQadmin2
            Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
            by SEQadmin2


            With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


            Introduction

            Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
            05-22-2026, 06:42 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by SEQadmin2, Yesterday, 06:09 AM
          0 responses
          16 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-09-2026, 11:58 AM
          0 responses
          37 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-05-2026, 10:09 AM
          0 responses
          43 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-04-2026, 08:59 AM
          0 responses
          49 views
          0 reactions
          Last Post SEQadmin2  
          Working...