SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
De novo SNP calling in absence of complete reference assembly fcr De novo discovery 15 09-21-2012 03:34 AM
RNA-seq SNP-calling without a complete reference shoegame2001 RNA Sequencing 6 07-04-2012 01:55 AM
GATK snp calling wanguan2000 Bioinformatics 0 11-24-2011 09:23 PM
GATK UnifiedGenotyper SNP calling with Agilent 50Mb target enrichment nullalleles Genomic Resequencing 1 07-18-2011 07:28 AM
What cut off does GATK snp calling use for mapping quality score? foxyg Bioinformatics 0 09-25-2010 12:19 PM

Reply
 
Thread Tools
Old 04-27-2012, 05:15 AM   #1
thedamian
Member
 
Location: Barcelona

Join Date: Feb 2012
Posts: 49
Default GATK SNP calling complete workflow

Hi All,
I wolud like to consult my GATK workflow for a pair end Illumina data.
Generally I'm calling SNPs using following steps:

Code:
bwa aln -t 4 hg19.fa seq1.fastq > 1.sai
bwa aln -t 4 hg19.fa seq2.fastq > 2.sai
bwa sampe -r "@RG\tID:exomeID\tLB:exomeLB\tSM:exomeSM\tPL:illumina\tPU:exomePU" hg19.fa 1.sai 2.sai seq1.fastq seq2.fastq > original.sam

java -Xmx5g -jar FixMateInformation.jar I=original.sam O=fixed.sam SO=coordinate VALIDATION_STRINGENCY=LENIENT
java -Xmx5g -jar SortSam.jar I=fixed.sam SO=coordinate O=first.bam VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true
java -Xmx5g -jar MarkDuplicates.jar I=first.bam O=marked.bam METRICS_FILE=metricsFile CREATE_INDEX=true VALIDATION_STRINGENCY=LENIENT REMOVE_DUPLICATES=true

java -Xmx5g -jar GenomeAnalysisTK.jar -nt 4 -T RealignerTargetCreator -R hg19.fa -o intervalsList -I marked.bam -known Mills_and_1000G_gold_standard.indels.hg19.vcf
java -Xmx5g -jar GenomeAnalysisTK.jar -nt 4 -T IndelRealigner -R hg19.fa -I marked.bam -targetIntervals intervalsList -known Mills_and_1000G_gold_standard.indels.hg19.vcf -o realigned.bam
java -Xmx5g -jar GenomeAnalysisTK.jar -nt 4 -T CountCovariates -l INFO -R hg19.fa -I realigned.bam -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate -recalFile recalFile -knownSites dbsnp_135.hg19.vcf
java -Xmx5g -jar GenomeAnalysisTK.jar -nt 4 -T TableRecalibration -R hg19.fa -I realigned.bam -o recalibrated.bam -recalFile recalFile
java -Xmx5g -jar GenomeAnalysisTK.jar -nt 4 -T UnifiedGenotyper -R hg19.fa -I recalibrated.bam -o resultSNPs.vcf -D dbsnp_135.hg19.vcf -metrics UniGenMetrics -stand_call_conf 50.0 -stand_emit_conf 10.0 -dcov 1000 -A DepthOfCoverage -A AlleleBalance -L exomes.bed
The SNP and indels databases I've downloaded from ftp://gsapubftp-anonymous@ftp.broadinstitute.org (bunlde -> 1.5 -> hg19)

The exome intervals I've gained using UCSC Table Browser http://genome.ucsc.edu/cgi-bin/hgTables?command=start

I'm not sure if I'm doing it in a correct way. Is my way compact enough? What about yours workflows? Are your steps look simillar? Should I use more GATK subprograms to obtain more accurate results?

Thanks in advance for suggestions
thedamian is offline   Reply With Quote
Old 04-27-2012, 05:58 AM   #2
Rocketknight
Member
 
Location: Ireland

Join Date: Sep 2011
Posts: 86
Default

The Mills and 1000G gold standard indels are a very stringently curated list of indels. For the purposes of indel realignment in GATK you probably want one or more of the broader sets from the GATK bundle instead, though which ones I'm not completely sure.

Overall though, your workflow looks fine. Bear in mind though, that it will only call SNPs and not indels (but since you're calling your output file resultSNPs I assume you know that already).
Rocketknight is offline   Reply With Quote
Old 05-08-2012, 07:06 AM   #3
thedamian
Member
 
Location: Barcelona

Join Date: Feb 2012
Posts: 49
Default

Is the order of funtions important?
Does this:
Code:
FixMateInformation
SortSam
MarkDuplicates
RealignerTargetCreator
IndelRealigner
mean the same as this:
Code:
SortSam
FixMateInformation
RealignerTargetCreator
IndelRealigner 
MarkDuplicates
thedamian is offline   Reply With Quote
Old 05-29-2012, 08:03 AM   #4
thedamian
Member
 
Location: Barcelona

Join Date: Feb 2012
Posts: 49
Default

Rocketknight indeed I didn't want indels, but now I need it. Do you know which option should I use to obtain a final file with SNPs and Indels?
thedamian is offline   Reply With Quote
Old 05-29-2012, 09:07 AM   #5
Rocketknight
Member
 
Location: Ireland

Join Date: Sep 2011
Posts: 86
Default

Yep, the only line you need to change is UnifiedGenotyper. Add the option -glm BOTH and the output VCF file will have SNPs and indels in it. Note that if you want to do Variant Quality Score Recalibration you'll have to do some things differently if you're recalibrating indels too.
Rocketknight is offline   Reply With Quote
Old 05-30-2012, 01:20 AM   #6
ymc
Senior Member
 
Location: Hong Kong

Join Date: Mar 2010
Posts: 498
Default

Some indel positions are off in dbsnp135_b37.vcf. How do u deal with them??
ymc is offline   Reply With Quote
Old 10-07-2013, 12:47 PM   #7
vd4mindia
Member
 
Location: Milan

Join Date: May 2013
Posts: 40
Default

"The exome intervals I've gained using UCSC Table Browser http://genome.ucsc.edu/cgi-bin/hgTables?command=start"
Can you tell me how to get the exome intervals as you have referred
vd4mindia is offline   Reply With Quote
Old 10-07-2013, 02:49 PM   #8
thedamian
Member
 
Location: Barcelona

Join Date: Feb 2012
Posts: 49
Default

Quote:
Originally Posted by vd4mindia View Post
"The exome intervals I've gained using UCSC Table Browser http://genome.ucsc.edu/cgi-bin/hgTables?command=start"
Can you tell me how to get the exome intervals as you have referred
set these paramaters: http://pokazywarka.pl/1ve6r7/ and then click "get output". Adjust your parameters and then click "get BED".
thedamian is offline   Reply With Quote
Old 10-09-2013, 01:47 AM   #9
vd4mindia
Member
 
Location: Milan

Join Date: May 2013
Posts: 40
Default

Hi I am facing some problems with the Local realignment step around the indels, I am using the marked bam file after the PCR duplicate marking step but I am getting the following error
##### ERROR MESSAGE: Invalid command line: Cannot process the provided BAM file(s) because they were not indexed. The GATK does offer limited processing of unindexed BAMs in --unsafe mode, but this GATK feature is currently unsupported.

I donot see such errors in the forums. Is it that after marknig the PCR duplicates the bam files needs to be sorted and indexed again? or just indexing and running the target realigner step will be sufficient? Am using the GATK version 2.3 am sure its not a serious problem but am not being able to troubleshoot it. I am trying to get a bit stringent in this local indel realignment so for more specificity am actually using the DBSNP137.vcf , 1000G indel vcf and the Mills and 1000 G gold standard vcf. Please guide me about this. I am using the GATK for the first time and trying to develop a pipeline. Below is the command line am using for this step

java -Xmx14g -jar /data/PGP/gmelloni/GenomeAnalysisTK-2.3-4-g57ea19f/GenomeAnalysisTK.jar -R /scratch/GT/vdas/test_exome/exome/hg19.fa -I SRR062634_marked.sorted.bam -T RealignerTargetCreator -known /scratch/GT/vdas/test_exome/exome/databases/dbsnp_137.hg19.vcf -known /scratch/GT/vdas/test_exome/exome/databases/1000G_phase1.indels.hg19.vcf -known /scratch/GT/vdas/test_exome/exome/databases/Mills_and_1000G_gold_standard.indels.hg19.vcf -o SRR062634_marked.sorted.bam.inervals -S LENIENT
vd4mindia is offline   Reply With Quote
Old 10-09-2013, 02:06 AM   #10
thedamian
Member
 
Location: Barcelona

Join Date: Feb 2012
Posts: 49
Default

@vd4mindia: I don't know, maybe try to index it. You shoud have in your directory files SRR062634_marked.sorted.bam and SRR062634_marked.sorted.bai.
If this won't help I'm not able to solve this issue.
thedamian is offline   Reply With Quote
Old 10-09-2013, 02:50 AM   #11
choishingwan
Member
 
Location: Hong Kong

Join Date: Feb 2012
Posts: 21
Default

I thought fixMate is no longer required by the GATK pipeline? And if you want to perform the fixMate then sortSam, you can run FixMate with the SO options in SortSam, then your output will be automatically sorted in one go. It is also a good idea to always include Create_Index=TRUE in the pipeline to make sure you have the index for every steps just to be safe
choishingwan is offline   Reply With Quote
Old 10-09-2013, 02:59 AM   #12
vd4mindia
Member
 
Location: Milan

Join Date: May 2013
Posts: 40
Default

@thedamian,

Yes I fixed it , I indexed it again and then made the call and the local realignment step is running now. But i need some inputs for the base quality recalibration step. In the above thread this step is written as below
java -Xmx5g -jar GenomeAnalysisTK.jar -nt 4 -T CountCovariates -l INFO -R hg19.fa -I realigned.bam -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate -recalFile recalFile -knownSites dbsnp_135.hg19.vcf

Since during my Realigner Target Creator step I used dbSNP_137.vcf for known sites I should be using the same as well for the Base quality step right?

java -Xmx14g -jar /data/PGP/gmelloni/GenomeAnalysisTK-2.3-4-g57ea19f/GenomeAnalysisTK.jar -R /scratch/GT/vdas/test_exome/exome/hg19.fa -I SRR062634_marked.sorted.bam -T RealignerTargetCreator -known /scratch/GT/vdas/test_exome/exome/databases/dbsnp_137.hg19.vcf -known /scratch/GT/vdas/test_exome/exome/databases/1000G_phase1.indels.hg19.vcf -known /scratch/GT/vdas/test_exome/exome/databases/Mills_and_1000G_gold_standard.indels.hg19.vcf -o SRR062634_marked.sorted.bam.inervals -S LENIENT

As you can see the known sites /scratch/GT/vdas/test_exome/exome/databases/dbsnp_137.hg19.vcf

So my script for this step should be

For the Base calibrator( i see some place its written count covariates, which should I do) any suggestion but am following base calibrator below is the command

java -Xmx14g -jar /data/PGP/gmelloni/GenomeAnalysisTK-2.3-4-g57ea19f/GenomeAnalysisTK.jar -nt 8 -T BaseRecalibrator -R /scratch/GT/vdas/test_exome/exome/hg19.fa -knownSites /scratch/GT/vdas/test_exome/exome/databases/dbsnp_137.hg19.vcf -I SRR062634.realigned.bam -o SRR062634.realigned.recal.csv -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov ContextCovariate -S LENIENT

for table recalibration

java -Xmx14g -jar /data/PGP/gmelloni/GenomeAnalysisTK-2.3-4-g57ea19f/GenomeAnalysisTK.jar -T TableRecalibration -R /scratch/GT/vdas/test_exome/exome/hg19.fa -I SRR062634.realigned.bam -o SRR062634.realigned.bam.recal.bam -BQSR SRR062634.realigned.recal.csv -S LENIENT

Please let me know if this looks correct or not?
vd4mindia is offline   Reply With Quote
Old 10-09-2013, 03:02 AM   #13
vd4mindia
Member
 
Location: Milan

Join Date: May 2013
Posts: 40
Default

Yes @choishingwan I checked and the fixmate step is retired and am not using it but I have another query in the above line please let me know and share your views
vd4mindia is offline   Reply With Quote
Old 10-09-2013, 03:25 AM   #14
choishingwan
Member
 
Location: Hong Kong

Join Date: Feb 2012
Posts: 21
Default

@vd4mindia: From your code, I guess you are running GATK version before 2.6 right? You are right about using the dbsnp137. From the look of it it seems correct. In case if you want to use the latest gatk, you can refer to the following site: http://gatkforums.broadinstitute.org...se-2-0-retired

Personally, I use the following code if it helps:

Quote:
samtools sort <sample>.bwa.bam <sample>.bwa.sort

#Index
samtools index <sample>.bwa.sort.bam

#mark duplicate
java -Xmx5g -jar MarkDuplicates.jar INPUT=<sample>.bwa.sort.bam OUTPUT=<sample>.bwa.sort.deduped.bam METRICS_FILE=<sample>.duplicates REMOVE_DUPLICATES=TRUE VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=TRUE


#Realignment based on known insert sites (Using Java 1.7 from now on as required by GATK)
java -Xmx5g -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R Reference.fa -I <sample>.bwa.sort.deduped.arg.bam -known 1000G_phase1.indels.hg19.vcf -known Mills_and_1000G_gold_standard.indels.hg19.vcf -o <sample>.realign.intervals -S LENIENT

java -Xmx5g -jar GenomeAnalysisTK.jar -T IndelRealigner -R Reference.fa -I <sample>.bwa.sort.deduped.arg.bam -targetIntervals <sample>.realign.intervals -known 1000G_phase1.indels.hg19.vcf -known Mills_and_1000G_gold_standard.indels.hg19.vcf -o <sample>.bwa.sort.deduped.arg.realigned.bam -S LENIENT

java -Xmx5g -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R Reference.fa -l INFO -I <sample>.bwa.sort.deduped.arg.realigned.bam -knownSites 1000G_phase1.indels.hg19.vcf -knownSites Mills_and_1000G_gold_standard.indels.hg19.vcf -knownSites dbsnp_137.hg19.vcf -o <sample>.recalibration_report.grp -S LENIENT

java -Xmx5g -jar GenomeAnalysisTK.jar -T PrintReads -R Reference.fa -l INFO -I <sample>.bwa.sort.deduped.arg.realigned.bam -BQSR <sample>.recalibration_report.grp -o <sample>.bwa.sort.deduped.arg.realigned.recalibrated.bam -S LENIENT
This is how we do the analysis here in our lab, not necessarily the correct way. We did use the latest GATK though, so some of the code might differ from yours.
choishingwan is offline   Reply With Quote
Old 10-09-2013, 03:31 AM   #15
vd4mindia
Member
 
Location: Milan

Join Date: May 2013
Posts: 40
Default

@thedamain,

I need some information about the exomes.bed file which you are using. I see that most of the pipelines say that they generate it from the UCSC genome browser to restrict the output to exonic sequences, is there any specific cut off that must be used to generate this bed file for the hg19 ref gene or a simple selection will do? or do i have to specify some extra bp addition for both end to catch the splice sites? Please let me know. On what criteria is the bed file selected or it is just a representation bed file of the whole ref genome restricted only to the exonic regions? Please let me know
vd4mindia is offline   Reply With Quote
Old 10-09-2013, 03:40 AM   #16
vd4mindia
Member
 
Location: Milan

Join Date: May 2013
Posts: 40
Default

@choishingwan ,

Thanks for your prompt reply. I would like to tell you that I am using the version 2.3 not the 2.7 one so is it wrong to use the DBSNP_137.vcf file with that? I hope this should not be a problem for the GATK toolkit to work with the latest DBSNP file right as this is just the upated data set also in a step I see you have used the print reads command and it is usually done when you merge all you samples together for and generate a single bam file and then try to call the SNP and INDEL but since am just developing a new pipeline with a test sample from the 1000 G project I have only one sample ( paired-end) where am trying to understand each step of the pipeline and so that once I have my samples I can use that pipeline on them. Please can you clarify the the doubts about the print reads step or can I use the below script when am not merging all the samples together and use Table Recalibrator

ava -Xmx14g -jar /data/PGP/gmelloni/GenomeAnalysisTK-2.3-4-g57ea19f/GenomeAnalysisTK.jar -T TableRecalibration -R /scratch/GT/vdas/test_exome/exome/hg19.fa -I SRR062634.realigned.bam -o SRR062634.realigned.bam.recal.bam -BQSR SRR062634.realigned.recal.csv -S LENIENT
vd4mindia is offline   Reply With Quote
Old 10-09-2013, 04:20 AM   #17
choishingwan
Member
 
Location: Hong Kong

Join Date: Feb 2012
Posts: 21
Default

There's definitely no problem in using the latest dbsnp file with the old gatk pipeline. As for the print read problem, I think that's when the version different take place. In my case, in the latest gatk program, the recalibrate will not produce a new bam file as with 2.3, instead it produce a file to use for printread to print the recalibrated file. As for your question regarding merging the file. From my experience, it is unnecessary as you can input multiple files for unifiedgenotyper using multiple -I. Though I think it is advised to call all samples together ( not sure if that's the case if the capture kit are different / the sequencing platform for each sample is different, hope someone can answer that for me)


Side note, if you are doing exome sequencing, you should get the bed file from the company stating the capture area and you won't need to make one yourself.
choishingwan is offline   Reply With Quote
Old 10-09-2013, 04:34 AM   #18
vd4mindia
Member
 
Location: Milan

Join Date: May 2013
Posts: 40
Default

@choishingwan

I guess then I can use the option of TableRecalibration, I am yet to run it , if it produces a bam file good enough else I will use the print read option which will surely produce a bam file. As for multiple sample with merged bam file, I will be getting my samples say (4 samples are sequenced in one lane) then I guess I might have to use the merged bam file concept or might simply work with individual samples and during the unified genotyper I can use all of them together (what do you think). My sample for project will be 2 Tumor and 2 IPSC from the same 2 tumors. So it is 2 tumors from 2 different patients and their respective IPSC lines and I will try to work it out to find the SNP INDELs SNVs that will confer that the mutations between tumor and IPSC are same so that they have same genetic background but then in that case during the SNP INDEL calling using unified genotyper is it advisable to use all the recaliberated bam files from 4 samples to call together and get the SNP and INDELS right? Any thoughts on this?

Bedfile query:
This is what I was worried about regarding the bed file, as for the test analysis I am doing I can recreate a bed file myself from the genome browser for testing my analysis? Let me know your thoughts on this.
vd4mindia is offline   Reply With Quote
Old 10-09-2013, 06:41 AM   #19
choishingwan
Member
 
Location: Hong Kong

Join Date: Feb 2012
Posts: 21
Default

From my memory, you should be able to get a bam file from the TableRecalibration. From what I understand, you only have to merge samples (as in, really merge samples into one single bam file) when you sequence the same sample on multiple lane. You can refer to the following:
http://gatkforums.broadinstitute.org...he-same-sample

And also the best practice link that I have previously attached. According to my labmates, they said merging the bam file made it difficult to analyse downstream as it will be a huge file (e.g. 4 6GB bam merged will become ~24GB). All in all, I'd really recommend you reading the best practice guide from GATK, they cover most stuff you need to know.

As for the bed file, it is just for removing SNPs in areas that shouldn't be captured by the kit. So most of the time, you should focus only on the capture areas. However, if you don't have that information and couldn't get it from the people, then you can consider using only the exome part from the UCSC. I have never tried that but I guess you can, my recommendation is to always get the capture information from the people and make sure you know what you are working with.
choishingwan is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:39 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO