![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Somatic Mutations in dbSNP | qqcandy | Bioinformatics | 14 | 07-27-2015 03:34 PM |
Samtools mpileup_Paired Tumoral / Germline_keep only somatic mutations | Sam64 | Genomic Resequencing | 2 | 02-29-2012 12:01 PM |
Option "calmd"; Reporting indels and Somatic mutations for Whole Exome Seq data: | angerusso | Bioinformatics | 0 | 01-10-2012 04:32 PM |
How to find out SNP and point mutations in RNA-Seq data using Galaxy? | sepulveda | RNA Sequencing | 0 | 12-29-2011 02:50 PM |
Any pipeline to find automatically ORF in consensus sequences? | Christopher Sauvage | Bioinformatics | 6 | 05-21-2010 06:09 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: spain Join Date: Feb 2011
Posts: 60
|
![]()
Hi, I am trying to analyze Illumina paired-end reads of a tumor/normal study with two lanes per sample (i.e. normal.lane1, normal.lane2, tumor.lane1, tumor.lane2) to assess somatic mutations. My intention is to work with a single bam per individual as recommended in the GATK best practice variant detection in tumor-normal studies (http://www.broadinstitute.org/gsa/wi...th_the_GATK_v2).
Therefore, the workflow I am trying to perform is the following: Code:
For each reads_file (lane): - alignment using Bfast bwaaln / localalign / postprocess - sam to bam using samtools view - add read groups using picard AddOrReplaceReadGroups - remove duplicates using picard RemoveDuplicates Code:
sample.bam <- picard MergeSamFile (tumor.lane1.bam, tumor.lane2.bam, normal.lane1.bam, normal.lane2.bam) Code:
realigned.bam <- GATK_realign(individual.bam) recal.bam <- GATK_recalibrate(realigned.bam) Code:
GATK_call_for_variants(recal.bam) using the GATK UnifiedGenotyper and VariantFiltration I used the --genotypeFilterExpression argument in the GATK_VariantFiltration step to have results at sample level, so what I've obtained by using a dummy example with a few reads per sample in the resulting vcf file looks like: Code:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT normal tumor chr1 115401136 . G A 40.14 PASS (..) GT:AD:DP:GQ:PL 1/1:0,1:1:3:29,3,0 1/1:0,2:2:3.01:44,3,0 chr2 207625567 . T C 32.79 PASS (..) GT:AD:DP:GQ:PL ./. 1/1:0,2:2:6.01:64,6,0 I'm quite unsure how is the best way to interpret these results, also I'm not convinced whether the workflow I've used is the optimal. Any help/suggestion will be really appreciated! cheers David |
![]() |
![]() |
![]() |
#2 |
Senior Member
Location: Boston Join Date: Feb 2008
Posts: 693
|
![]()
Three suggestions:
a) map your reads to: ftp://ftp-trace.ncbi.nih.gov/1000gen...mbly_sequence/ b) use two distinct mappers (e.g. bwaaln and bwa-sw), call mutations separately and then take the intersection. c) (optional) you may consider samtools contrast caller. |
![]() |
![]() |
![]() |
#3 |
Member
Location: Spain Join Date: Jul 2010
Posts: 68
|
![]()
I think the question was more referring to how to discover the differences between paired normal and tumor samples from one single patient.
The multi-sample vcf allows the following information: - It lists the variant positions and changes detected in any of the two samples. - In the columns after the FORMAT column, normal and tumor in the example, it displays the stats for each of the variants in the two samples. Obviously, there are variants that are recorded in both normal and tumor, and there are also those that are recorded in the normal sample only and not in the tumor and vice versa: - Variants present in both "normal" and "tumoral" may be interpreted as "germline" variants that are present in all cells of the patient, tumor or not. - Then, we can obtain a rough selection of "tumor-specific" variants (those that are recorded only in "tumor"). - Variants that appear in "normal" only, may point at deletions of this region in the tumor sample, given sufficient read coverage at the region of interest. However, the multi-sample vcf does not give the statistical information described in FORMAT for the respective sample that does NOT contain the variant (entry is ./. ). Maybe it would help to run the Unified Genotyper without distinguishing between the two samples, in order to get the overall genotype stats, but I am not sure about that. Any opinion on / experience with that? |
![]() |
![]() |
![]() |
#4 |
Junior Member
Location: Paris Join Date: Dec 2012
Posts: 5
|
![]()
There are my command lines for calling snps and indels with GATK. It did work well for me and might be useful for you.
bwa aln -t 4 ucsc.hg19 /home/li/prostate_cancer_RNA/ERR031030_1.fastq > ERR031030_1.sai bwa aln -t 4 ucsc.hg19 /home/li/prostate_cancer_RNA/ERR031030_2.fastq > ERR031030_2.sai bwa sampe -r "@RG\tID:ERR031030\tLB:ERR031030\tSM:ERR031030\tPL:ILLUMINA" ucsc.hg19 ERR031030_1.sai ERR031030_2.sai /home/li/prostate_cancer_RNA/ERR031030_1.fastq /home/li/prostate_cancer_RNA/ERR031030_2.fastq > ERR031030.sam java -Xmx8g -Djava.io.tmpdir=/tmp -jar /home/li/prostate_cancer_RNA/picard-tools-1.82/SortSam.jar SO=coordinate INPUT=ERR031030.sam OUTPUT=ERR031030.bam VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true java -Xmx8g -Djava.io.tmpdir=/tmp -jar /home/li/prostate_cancer_RNA/picard-tools-1.82/MarkDuplicates.jar INPUT=ERR031030.bam OUTPUT=ERR031030.marked.bam METRICS_FILE=metrics CREATE_INDEX=true VALIDATION_STRINGENCY=LENIENT java -Xmx8g -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -T RealignerTargetCreator -R ucsc.hg19.fasta -o ERR031030.bam.list -I ERR031030.marked.bam java -Xmx8g -Djava.io.tmpdir=/tmp -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -I ERR031030.marked.bam -R ucsc.hg19.fasta -T IndelRealigner -targetIntervals ERR031030.bam.list -o ERR031030.marked.realigned.bam java -Xmx8g -Djava.io.tmpdir=/tmp -jar /home/li/prostate_cancer_RNA/picard-tools-1.82/FixMateInformation.jar INPUT=ERR031030.marked.realigned.bam OUTPUT=ERR031030.marked.realigned.fixed.bam SO=coordinate VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true java -Xmx8g -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -T BaseRecalibrator -R ucsc.hg19.fasta -I ERR031030.marked.realigned.fixed.bam -knownSites Mills_and_1000G_gold_standard.indels.hg19.vcf -knownSites 1000G_phase1.indels.hg19.vcf -knownSites dbsnp_135.hg19.vcf -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov ContextCovariate -o ERR031030.recal_data.grp java -Xmx8g -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -T PrintReads -R ucsc.hg19.fasta -I ERR031030.marked.realigned.fixed.bam -BQSR ERR031030.recal_data.grp --out ERR031030.marked.realigned.fixed.recal.bam java -Xmx16g -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -glm BOTH -R ucsc.hg19.fasta -T UnifiedGenotyper -I ERR031029.marked.realigned.fixed.recal.bam -I ERR031030.marked.realigned.fixed.recal.bam -D dbsnp_135.hg19.vcf -o ERR031030.raw.snps.indels.vcf -metrics snps.metrics -stand_call_conf 50.0 -stand_emit_conf 10.0 -dcov 1000 java -Xmx8g -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -R ucsc.hg19.fasta -T SelectVariants --variant ERR031030.raw.snps.indels.vcf -o ERR031030.snpsonly.vcf -selectType SNP java -Xmx8g -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -R ucsc.hg19.fasta -T SelectVariants --variant ERR031030.raw.snps.indels.vcf -o ERR031030.indelsonly.vcf -selectType INDEL java -Xmx8g -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -T VariantRecalibrator -R ucsc.hg19.fasta -input ERR031030.snpsonly.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.hg19.vcf -resource ![]() java -Xmx8g -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -R ucsc.hg19.fasta -T ApplyRecalibration -input ERR031030.snpsonly.vcf -tranchesFile ERR031030.snp.tranches.vcf -recalFile ERR031030.snp.recal.vcf -o ERR031030.snps.filtered.vcf java -Xmx16g -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -R ucsc.hg19.fasta -T VariantFiltration --variant ERR031030.snps.filtered.vcf -o ERR031030.final.filtered.vcf --filterName "Nov28filters && QD < 2.0 && ReadPosRankSum < -8.0 && MQ < 40.0 && FS > 60.0 && MQRandkSum < -12.5" --filterExpression "HaplotypeScore > 13.0" By the way, to obtain a rough selection of "tumor-specific" variants (those that are recorded only in "tumor"), what command line do you use? any suggestion? |
![]() |
![]() |
![]() |
#5 |
Member
Location: spain Join Date: Feb 2011
Posts: 60
|
![]()
Hi,
I'm far (honestly) from being an expert in this issue, but as far as I know, if you perform the mutation call for the tumor sample and for the normal sample separately, once you have the final set of mutations for each then you simply substract one from the other to get the somatic ones. Of course, there is the option to use a variant caller which is already intended to deal with normal/tumor sample experiments. I've used Varscan, and seems pretty good for SNPs (not so much for indels), altough as I told you I'm not an expert in the field and for instance I was not able to use some of the filters that the second release of the tool made available. I also heard good things about SomaticSniper, but I did not use this one. At the end of the day, I think that this issue is far from being solved, and what many people do is to combine the results of several methods to obtain a list of highly confident somatic mutations. Hope it helps, best David |
![]() |
![]() |
![]() |
#6 |
Senior Member
Location: USA Join Date: Sep 2012
Posts: 130
|
![]()
Doing variant calls separately and then looking at the different will produce a large number of false positives. You really should be using somatic variant callers like MuTect or SomaticSniper that look at both the tumor and normal samples at the same time.
|
![]() |
![]() |
![]() |
#7 |
Senior Member
Location: Hong Kong Join Date: Mar 2010
Posts: 498
|
![]()
Do I still need to use two aligners as suggested by lh3?
|
![]() |
![]() |
![]() |
Tags |
haplotype, multisample bam, somatic mutations, vcf |
Thread Tools | |
|
|