Hi, I am trying to analyze Illumina paired-end reads of a tumor/normal study with two lanes per sample (i.e. normal.lane1, normal.lane2, tumor.lane1, tumor.lane2) to assess somatic mutations. My intention is to work with a single bam per individual as recommended in the GATK best practice variant detection in tumor-normal studies (http://www.broadinstitute.org/gsa/wi...th_the_GATK_v2).
Therefore, the workflow I am trying to perform is the following:
Then, the bams of the two samples are merged:
On this single bam (which contains reads from the 4 lanes and also 4 lines of @RG readgroup annotations in the header), I performed two more processing steps:
Then I called the variants:
I used the --genotypeFilterExpression argument in the GATK_VariantFiltration step to have results at sample level, so what I've obtained by using a dummy example with a few reads per sample in the resulting vcf file looks like:
Therefore, ignoring the fact that the number of reads supporting the evidence is low in this case, could I interpret that the first SNP is a germline mutation (the same GT is present in both normal and tumor samples of the individual) whereas the second is a somatic mutation (the variant is present only in the normal sample)?
I'm quite unsure how is the best way to interpret these results, also I'm not convinced whether the workflow I've used is the optimal.
Any help/suggestion will be really appreciated!
cheers
David
Therefore, the workflow I am trying to perform is the following:
Code:
For each reads_file (lane): - alignment using Bfast bwaaln / localalign / postprocess - sam to bam using samtools view - add read groups using picard AddOrReplaceReadGroups - remove duplicates using picard RemoveDuplicates
Code:
sample.bam <- picard MergeSamFile (tumor.lane1.bam, tumor.lane2.bam, normal.lane1.bam, normal.lane2.bam)
Code:
realigned.bam <- GATK_realign(individual.bam) recal.bam <- GATK_recalibrate(realigned.bam)
Code:
GATK_call_for_variants(recal.bam) using the GATK UnifiedGenotyper and VariantFiltration
I used the --genotypeFilterExpression argument in the GATK_VariantFiltration step to have results at sample level, so what I've obtained by using a dummy example with a few reads per sample in the resulting vcf file looks like:
Code:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT normal tumor chr1 115401136 . G A 40.14 PASS (..) GT:AD:DP:GQ:PL 1/1:0,1:1:3:29,3,0 1/1:0,2:2:3.01:44,3,0 chr2 207625567 . T C 32.79 PASS (..) GT:AD:DP:GQ:PL ./. 1/1:0,2:2:6.01:64,6,0
I'm quite unsure how is the best way to interpret these results, also I'm not convinced whether the workflow I've used is the optimal.
Any help/suggestion will be really appreciated!
cheers
David
Comment