Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
Somatic Mutations in dbSNP qqcandy Bioinformatics 14 07-27-2015 03:34 PM
Samtools mpileup_Paired Tumoral / Germline_keep only somatic mutations Sam64 Genomic Resequencing 2 02-29-2012 12:01 PM
Option "calmd"; Reporting indels and Somatic mutations for Whole Exome Seq data: angerusso Bioinformatics 0 01-10-2012 04:32 PM
How to find out SNP and point mutations in RNA-Seq data using Galaxy? sepulveda RNA Sequencing 0 12-29-2011 02:50 PM
Any pipeline to find automatically ORF in consensus sequences? Christopher Sauvage Bioinformatics 6 05-21-2010 06:09 AM

Thread Tools
Old 08-12-2011, 06:12 AM   #1
Location: spain

Join Date: Feb 2011
Posts: 60
Default Pipeline to find somatic mutations

Hi, I am trying to analyze Illumina paired-end reads of a tumor/normal study with two lanes per sample (i.e. normal.lane1, normal.lane2, tumor.lane1, tumor.lane2) to assess somatic mutations. My intention is to work with a single bam per individual as recommended in the GATK best practice variant detection in tumor-normal studies (

Therefore, the workflow I am trying to perform is the following:
    For each reads_file (lane):

- alignment using Bfast bwaaln / localalign / postprocess
- sam to bam using samtools view
- add read groups using picard AddOrReplaceReadGroups
- remove duplicates using picard RemoveDuplicates
Then, the bams of the two samples are merged:

    sample.bam <- picard MergeSamFile (tumor.lane1.bam, tumor.lane2.bam, normal.lane1.bam, normal.lane2.bam)
On this single bam (which contains reads from the 4 lanes and also 4 lines of @RG readgroup annotations in the header), I performed two more processing steps:

    realigned.bam <- GATK_realign(individual.bam)
    recal.bam <- GATK_recalibrate(realigned.bam)
Then I called the variants:

GATK_call_for_variants(recal.bam) using the GATK UnifiedGenotyper and VariantFiltration

I used the --genotypeFilterExpression argument in the GATK_VariantFiltration step to have results at sample level, so what I've obtained by using a dummy example with a few reads per sample in the resulting vcf file looks like:

#CHROM    POS        ID REF  ALT  QUAL FILTER    INFO           FORMAT                    normal                 tumor
chr1    115401136    .    G    A    40.14    PASS      (..)    GT:AD:DP:GQ:PL    1/1:0,1:1:3:29,3,0    1/1:0,2:2:3.01:44,3,0
chr2    207625567    .    T    C    32.79    PASS      (..)    GT:AD:DP:GQ:PL    ./.		     1/1:0,2:2:6.01:64,6,0
Therefore, ignoring the fact that the number of reads supporting the evidence is low in this case, could I interpret that the first SNP is a germline mutation (the same GT is present in both normal and tumor samples of the individual) whereas the second is a somatic mutation (the variant is present only in the normal sample)?

I'm quite unsure how is the best way to interpret these results, also I'm not convinced whether the workflow I've used is the optimal.

Any help/suggestion will be really appreciated!
david.tamborero is offline   Reply With Quote
Old 08-12-2011, 06:36 AM   #2
Senior Member
Location: Boston

Join Date: Feb 2008
Posts: 693

Three suggestions:

a) map your reads to:

b) use two distinct mappers (e.g. bwaaln and bwa-sw), call mutations separately and then take the intersection.

c) (optional) you may consider samtools contrast caller.
lh3 is offline   Reply With Quote
Old 08-16-2011, 05:53 AM   #3
Location: Spain

Join Date: Jul 2010
Posts: 68

I think the question was more referring to how to discover the differences between paired normal and tumor samples from one single patient.

The multi-sample vcf allows the following information:

- It lists the variant positions and changes detected in any of the two samples.
- In the columns after the FORMAT column, normal and tumor in the example, it displays the stats for each of the variants in the two samples. Obviously, there are variants that are recorded in both normal and tumor, and there are also those that are recorded in the normal sample only and not in the tumor and vice versa:
- Variants present in both "normal" and "tumoral" may be interpreted as "germline" variants that are present in all cells of the patient, tumor or not.
- Then, we can obtain a rough selection of "tumor-specific" variants (those that are recorded only in "tumor").
- Variants that appear in "normal" only, may point at deletions of this region in the tumor sample, given sufficient read coverage at the region of interest.

However, the multi-sample vcf does not give the statistical information described in FORMAT for the respective sample that does NOT contain the variant (entry is ./. ).

Maybe it would help to run the Unified Genotyper without distinguishing between the two samples, in order to get the overall genotype stats, but I am not sure about that.

Any opinion on / experience with that?
sdvie is offline   Reply With Quote
Old 01-21-2013, 04:58 AM   #4
LI Jia
Junior Member
Location: Paris

Join Date: Dec 2012
Posts: 5

There are my command lines for calling snps and indels with GATK. It did work well for me and might be useful for you.

bwa aln -t 4 ucsc.hg19 /home/li/prostate_cancer_RNA/ERR031030_1.fastq > ERR031030_1.sai

bwa aln -t 4 ucsc.hg19 /home/li/prostate_cancer_RNA/ERR031030_2.fastq > ERR031030_2.sai

bwa sampe -r "@RG\tID:ERR031030\tLB:ERR031030\tSM:ERR031030\tPL:ILLUMINA" ucsc.hg19 ERR031030_1.sai ERR031030_2.sai /home/li/prostate_cancer_RNA/ERR031030_1.fastq /home/li/prostate_cancer_RNA/ERR031030_2.fastq > ERR031030.sam

java -Xmx8g -jar /home/li/prostate_cancer_RNA/picard-tools-1.82/SortSam.jar SO=coordinate INPUT=ERR031030.sam OUTPUT=ERR031030.bam VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true

java -Xmx8g -jar /home/li/prostate_cancer_RNA/picard-tools-1.82/MarkDuplicates.jar INPUT=ERR031030.bam OUTPUT=ERR031030.marked.bam METRICS_FILE=metrics CREATE_INDEX=true VALIDATION_STRINGENCY=LENIENT

java -Xmx8g -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -T RealignerTargetCreator -R ucsc.hg19.fasta -o ERR031030.bam.list -I ERR031030.marked.bam
java -Xmx8g -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -I ERR031030.marked.bam -R ucsc.hg19.fasta -T IndelRealigner -targetIntervals ERR031030.bam.list -o ERR031030.marked.realigned.bam

java -Xmx8g -jar /home/li/prostate_cancer_RNA/picard-tools-1.82/FixMateInformation.jar INPUT=ERR031030.marked.realigned.bam OUTPUT=ERR031030.marked.realigned.fixed.bam SO=coordinate VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true

java -Xmx8g -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -T BaseRecalibrator -R ucsc.hg19.fasta -I ERR031030.marked.realigned.fixed.bam -knownSites Mills_and_1000G_gold_standard.indels.hg19.vcf -knownSites 1000G_phase1.indels.hg19.vcf -knownSites dbsnp_135.hg19.vcf -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov ContextCovariate -o ERR031030.recal_data.grp

java -Xmx8g -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -T PrintReads -R ucsc.hg19.fasta -I ERR031030.marked.realigned.fixed.bam -BQSR ERR031030.recal_data.grp --out ERR031030.marked.realigned.fixed.recal.bam

java -Xmx16g -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -glm BOTH -R ucsc.hg19.fasta -T UnifiedGenotyper -I ERR031029.marked.realigned.fixed.recal.bam -I ERR031030.marked.realigned.fixed.recal.bam -D dbsnp_135.hg19.vcf -o ERR031030.raw.snps.indels.vcf -metrics snps.metrics -stand_call_conf 50.0 -stand_emit_conf 10.0 -dcov 1000

java -Xmx8g -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -R ucsc.hg19.fasta -T SelectVariants --variant ERR031030.raw.snps.indels.vcf -o ERR031030.snpsonly.vcf -selectType SNP

java -Xmx8g -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -R ucsc.hg19.fasta -T SelectVariants --variant ERR031030.raw.snps.indels.vcf -o ERR031030.indelsonly.vcf -selectType INDEL

java -Xmx8g -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -T VariantRecalibrator -R ucsc.hg19.fasta -input ERR031030.snpsonly.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.hg19.vcf -resourcemni,known=false,training=true,truth=false,prior=12.0 1000G_phase1.indels.hg19.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 dbsnp_135.hg19.vcf -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an MQ -mode SNP -recalFile ERR031030.snp.recal.vcf -tranchesFile ERR031030.snp.tranches.vcf -rscriptFile ERR031030.plots.R

java -Xmx8g -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -R ucsc.hg19.fasta -T ApplyRecalibration -input ERR031030.snpsonly.vcf -tranchesFile ERR031030.snp.tranches.vcf -recalFile ERR031030.snp.recal.vcf -o ERR031030.snps.filtered.vcf

java -Xmx16g -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -R ucsc.hg19.fasta -T VariantFiltration --variant ERR031030.snps.filtered.vcf -o --filterName "Nov28filters && QD < 2.0 && ReadPosRankSum < -8.0 && MQ < 40.0 && FS > 60.0 && MQRandkSum < -12.5" --filterExpression "HaplotypeScore > 13.0"

By the way, to obtain a rough selection of "tumor-specific" variants (those that are recorded only in "tumor"), what command line do you use? any suggestion?
LI Jia is offline   Reply With Quote
Old 01-21-2013, 12:47 PM   #5
Location: spain

Join Date: Feb 2011
Posts: 60


I'm far (honestly) from being an expert in this issue, but as far as I know, if you perform the mutation call for the tumor sample and for the normal sample separately, once you have the final set of mutations for each then you simply substract one from the other to get the somatic ones.

Of course, there is the option to use a variant caller which is already intended to deal with normal/tumor sample experiments. I've used Varscan, and seems pretty good for SNPs (not so much for indels), altough as I told you I'm not an expert in the field and for instance I was not able to use some of the filters that the second release of the tool made available. I also heard good things about SomaticSniper, but I did not use this one. At the end of the day, I think that this issue is far from being solved, and what many people do is to combine the results of several methods to obtain a list of highly confident somatic mutations.

Hope it helps,
david.tamborero is offline   Reply With Quote
Old 08-08-2013, 10:46 AM   #6
Senior Member
Location: USA

Join Date: Sep 2012
Posts: 130

Doing variant calls separately and then looking at the different will produce a large number of false positives. You really should be using somatic variant callers like MuTect or SomaticSniper that look at both the tumor and normal samples at the same time.
id0 is offline   Reply With Quote
Old 08-09-2013, 03:05 AM   #7
Senior Member
Location: Hong Kong

Join Date: Mar 2010
Posts: 498

Originally Posted by id0 View Post
Doing variant calls separately and then looking at the different will produce a large number of false positives. You really should be using somatic variant callers like MuTect or SomaticSniper that look at both the tumor and normal samples at the same time.
Do I still need to use two aligners as suggested by lh3?
ymc is offline   Reply With Quote

haplotype, multisample bam, somatic mutations, vcf

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

All times are GMT -8. The time now is 12:15 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO