Seqanswers Leaderboard Ad

**fongchun** · 09-06-2012, 08:19 PM

Originally posted by Jane M View Post

Not yet...

I think I might have figured out at least part of your question with regard to the file recal_data.grp. If you look at the GATK methods and workflow page under the "Base Quality Score Recalibrator" section, it shows the recal_data.grp being used as part of the -BQSR parameter:

Code:

java -jar GenomeAnalysisTK.jar \
   -T PrintReads \
   -R reference.fasta \
   -I input.bam \
   -BQSR recalibration_report.grp \
   -o output.bam

\

Interesting thing is the documentation for the PrintReads program doesn't include the -BQSR parameter...

**AJERYC** · 09-06-2012, 10:13 PM

Originally posted by Jane M View Post

Thank you AJERYC!

Because of some troubles with my version of dbSNP, I haven't managed to run:

but I am still wondering if I should run the PrintReads step since I only have one bam file and if my recalibrated bam file will be the recal_data.grp file. Any idea?

I'm not sure if we are running the same version of GATK. For Quality score recalibration I use the following instructions

java -Xmx16G -jar gatk/GenomeAnalysisTK.jar -I input.marked.realigned.fixed.bam -R hg19/hg19.fa -T CountCovariates -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate -recalFile input.recal_data.csv -knownSites:dbsnp,VCF dbsnp135.hg19.vcf

java -Xmx16G -jar gatk/GenomeAnalysisTK.jar \-l INFO \-R hg19.fa \-I input.marked.realigned.fixed.bam \-T TableRecalibration \--out input.marked.realigned.fixed.recal.bam \-recalFile input.recal_data.csv

You can see I get 2 files, one is the bam file and the other one is the recal_data (that you get in the first instruction. Maybe you are missing the second instruction and that is why you dont get the bam file.

**Jane M** · 09-07-2012, 01:58 AM

Originally posted by AJERYC View Post

I'm not sure if we are running the same version of GATK. For Quality score recalibration I use the following instructions

java -Xmx16G -jar gatk/GenomeAnalysisTK.jar -I input.marked.realigned.fixed.bam -R hg19/hg19.fa -T CountCovariates -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate -recalFile input.recal_data.csv -knownSites:dbsnp,VCF dbsnp135.hg19.vcf

java -Xmx16G -jar gatk/GenomeAnalysisTK.jar \-l INFO \-R hg19.fa \-I input.marked.realigned.fixed.bam \-T TableRecalibration \--out input.marked.realigned.fixed.recal.bam \-recalFile input.recal_data.csv

You can see I get 2 files, one is the bam file and the other one is the recal_data (that you get in the first instruction. Maybe you are missing the second instruction and that is why you dont get the bam file.

The point is that we are not using the same version. You probably have a version before v2.0 and have a version after 2.0. From this 2.0 version, CountCovariates and TableRecalibration do not exist anymore. That's a pity because the process was rather clear. The csv file generated at the CountCovariates step is then used at the TableRecalibration step...

**Jane M** · 09-07-2012, 02:14 AM

Originally posted by fongchun View Post

I think I might have figured out at least part of your question with regard to the file recal_data.grp. If you look at the GATK methods and workflow page under the "Base Quality Score Recalibrator" section, it shows the recal_data.grp being used as part of the -BQSR parameter:

Code:

java -jar GenomeAnalysisTK.jar \
   -T PrintReads \
   -R reference.fasta \
   -I input.bam \
   -BQSR recalibration_report.grp \
   -o output.bam

\

Interesting thing is the documentation for the PrintReads program doesn't include the -BQSR parameter...

Ah, interesting.. I only noticed this information about PrintReads (http://www.broadinstitute.org/gatk/g...ntReads.html):

java -Xmx2g -jar GenomeAnalysisTK.jar \
-R ref.fasta \
-T PrintReads \
-o output.bam \
-I input1.bam \
-I input2.bam \
--read_filter MappingQualityZero

I didn't check where you suggested me. And here it's much clearer:

java -jar GenomeAnalysisTK.jar \
-T PrintReads \
-R reference.fasta \
-I input.bam \
-BQSR recalibration_report.grp \
-o output.bam

The grp file is used

and there is an output bam file

Thanks fongchun!

**rahilsethi** · 09-07-2012, 07:01 AM

GATK -dcov option???

I have additional question to raonyguimaraes's post
Does anyone know in details about GATK -dcov option in UnifiedGenotyper. I tried to look in GATK Manual but could not find much about it other than the following information:
-dcov [50 for 4x, 200 for >30x WGS or Whole exome]
in the link:

http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_genotyper_UnifiedGenotyper.html

Also if not specified what default value this option takes?

If you anyone knows about it could you please send me the link to the information resource?

Thanks in advance

**Jane M** · 09-11-2012, 08:21 AM

I am wondering if the step of variant quality score recalibration, after the variant calling is still in use. If I remember well, I read somewhere that it was no more performed. In addition, in the publications that I read recently, this step is not mentioned. Do you know why it has been abandoned?
Or what was the interest in the first place to recalibrate the quality of the variant bases after the variant calling, since there was the quality score recalibration before variant calling

?

**Jane M** · 09-14-2012, 07:05 AM

Originally posted by Jane M View Post

I am wondering if the step of variant quality score recalibration, after the variant calling is still in use. If I remember well, I read somewhere that it was no more performed. In addition, in the publications that I read recently, this step is not mentioned. Do you know why it has been abandoned?

Any suggestion?

**sdvie** · 09-16-2012, 11:39 PM

Originally posted by rahilsethi View Post

I have additional question to raonyguimaraes's post
Does anyone know in details about GATK -dcov option in UnifiedGenotyper. I tried to look in GATK Manual but could not find much about it other than the following information:
-dcov [50 for 4x, 200 for >30x WGS or Whole exome]
in the link:

http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_genotyper_UnifiedGenotyper.html

Also if not specified what default value this option takes?

If you anyone knows about it could you please send me the link to the information resource?

Thanks in advance

We had some discussion on this in the GATK forum here. Maybe that is of interest to you.

cheers,
Sophia

**Jane M** · 09-17-2012, 01:19 PM

Concerning the sam to bam conversion and suppression of PCR duplicates steps, are there any reason to prefer Picard to samtools?
I tried SortSam from Picard and it seems to take much more time than samtools view + samtools sort.
I think I will use samtools, but I would like to know if there are advantages when using Picard.
Thank you

**ddaneels** · 09-20-2012, 04:26 AM

I get the following error when using GATK to perform local realignment around indels.

Anyone an idea what went wrong?

Code:

E:\EXOME DATA ANALYSIS\1 Unzipped fastq>java -jar GenomeAnalysisTK.jar -T Realig
nerTargetCreator -R hg19.fa -o Ot2363.bam.list -I Ot2363.marked.bam
INFO  13:45:07,701 HelpFormatter - ---------------------------------------------
-----------------------------------
INFO  13:45:07,710 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.1-9-gb9
0951c, Compiled 2012/09/19 21:18:53
INFO  13:45:07,710 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO  13:45:07,710 HelpFormatter - For support and documentation go to http://ww
w.broadinstitute.org/gatk
INFO  13:45:07,712 HelpFormatter - Program Args: -T RealignerTargetCreator -R hg
19.fa -o Ot2363.bam.list -I Ot2363.marked.bam
INFO  13:45:07,712 HelpFormatter - Date/Time: 2012/09/20 13:45:07
INFO  13:45:07,712 HelpFormatter - ---------------------------------------------
-----------------------------------
INFO  13:45:07,713 HelpFormatter - ---------------------------------------------
-----------------------------------
INFO  13:45:07,720 GenomeAnalysisEngine - Strictness is SILENT
INFO  13:45:07,723 ReferenceDataSource - Index file E:\EXOME DATA ANALYSIS\1 Unz
ipped fastq\hg19.fa.fai does not exist. Trying to create it now.
PROGRESS UPDATE: file is 15 percent complete
PROGRESS UPDATE: file is 28 percent complete
PROGRESS UPDATE: file is 39 percent complete
PROGRESS UPDATE: file is 54 percent complete
PROGRESS UPDATE: file is 67 percent complete
PROGRESS UPDATE: file is 77 percent complete
PROGRESS UPDATE: file is 89 percent complete
PROGRESS UPDATE: file is 99 percent complete
##### ERROR --------------------------------------------------------------------
----------------------
##### ERROR A USER ERROR has occurred (version 2.1-9-gb90951c):
##### ERROR The invalid arguments or inputs must be corrected before the GATK ca
n proceed
##### ERROR Please do not post this error to the GATK forum
##### ERROR
##### ERROR See the documentation (rerun with -h) for this tool to view allowabl
e command-line arguments.
##### ERROR Visit our website and forum for extensive documentation and answers
to
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: Couldn't write file E:\EXOME DATA ANALYSIS\1 Unzipped fastq
\hg19.fa.fai because exception The process cannot access the file because anothe
r process has locked a portion of the file
##### ERROR --------------------------------------------------------------------
----------------------

**AJERYC** · 09-20-2012, 06:08 AM

Originally posted by ddaneels View Post

I get the following error when using GATK to perform local realignment around indels.

Anyone an idea what went wrong?

Code:

E:\EXOME DATA ANALYSIS\1 Unzipped fastq>java -jar GenomeAnalysisTK.jar -T Realig
nerTargetCreator -R hg19.fa -o Ot2363.bam.list -I Ot2363.marked.bam
INFO  13:45:07,701 HelpFormatter - ---------------------------------------------
-----------------------------------
INFO  13:45:07,710 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.1-9-gb9
0951c, Compiled 2012/09/19 21:18:53
INFO  13:45:07,710 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO  13:45:07,710 HelpFormatter - For support and documentation go to http://ww
w.broadinstitute.org/gatk
INFO  13:45:07,712 HelpFormatter - Program Args: -T RealignerTargetCreator -R hg
19.fa -o Ot2363.bam.list -I Ot2363.marked.bam
INFO  13:45:07,712 HelpFormatter - Date/Time: 2012/09/20 13:45:07
INFO  13:45:07,712 HelpFormatter - ---------------------------------------------
-----------------------------------
INFO  13:45:07,713 HelpFormatter - ---------------------------------------------
-----------------------------------
INFO  13:45:07,720 GenomeAnalysisEngine - Strictness is SILENT
INFO  13:45:07,723 ReferenceDataSource - Index file E:\EXOME DATA ANALYSIS\1 Unz
ipped fastq\hg19.fa.fai does not exist. Trying to create it now.
PROGRESS UPDATE: file is 15 percent complete
PROGRESS UPDATE: file is 28 percent complete
PROGRESS UPDATE: file is 39 percent complete
PROGRESS UPDATE: file is 54 percent complete
PROGRESS UPDATE: file is 67 percent complete
PROGRESS UPDATE: file is 77 percent complete
PROGRESS UPDATE: file is 89 percent complete
PROGRESS UPDATE: file is 99 percent complete
##### ERROR --------------------------------------------------------------------
----------------------
##### ERROR A USER ERROR has occurred (version 2.1-9-gb90951c):
##### ERROR The invalid arguments or inputs must be corrected before the GATK ca
n proceed
##### ERROR Please do not post this error to the GATK forum
##### ERROR
##### ERROR See the documentation (rerun with -h) for this tool to view allowabl
e command-line arguments.
##### ERROR Visit our website and forum for extensive documentation and answers
to
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: Couldn't write file E:\EXOME DATA ANALYSIS\1 Unzipped fastq
\hg19.fa.fai because exception The process cannot access the file because anothe
r process has locked a portion of the file
##### ERROR --------------------------------------------------------------------
----------------------

I think the error is here:
Couldn't write file E:\EXOME DATA ANALYSIS\1 Unzipped fastq
check up for Linux write permissions of the directory, harddisk space...

**wwhlazio** · 10-19-2012, 05:19 AM

Have you got any answer to this issue?

Have you got any answer to this issue?

Thanks!

Wen

Originally posted by blackgore View Post

In following the workflow mentioned above, I've come up against an error, and I'm wondering if I'm alone in this. Has anyone experienced difficulty with using CountCovariates tool, specifically with errors regarding accessing information from the input BAM file? I've tried this with several samples, but keep getting the same error, "Bad input: Could not find any usable data in the input BAM file(s)"

(for those interested, the BAM files in question are not empty, and work just fine with samtools view).

Code:

java -Xmx16g -jar /$Software/GenomeAnalysisTK-1.3-17-gc62082b/GenomeAnalysisTK.jar -T CountCovariates -R /$Genomes/Broad/Human/b37/human_g1k_v37.fasta -I $Projects/data/SampleA_bowtie.gatk.realign.bam -nt 8 -l INFO -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate -log RECAL.log -recalFile RECAL.csv --knownSites $Genomes/Broad/Human/b37/dbsnp_132.b37.vcf

INFO  14:01:25,870 HelpFormatter - ---------------------------------------------------------------------------------
INFO  14:01:25,875 HelpFormatter - The Genome Analysis Toolkit (GATK) v1.3-17-gc62082b, Compiled 2011/11/18 15:24:46
INFO  14:01:25,875 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO  14:01:25,876 HelpFormatter - Please view our documentation at [url]http://www.broadinstitute.org/gsa/wiki[/url]
INFO  14:01:25,876 HelpFormatter - For support, please view our support site at [url]http://getsatisfaction.com/gsa[/url]
INFO  14:01:25,877 HelpFormatter - Program Args: -T CountCovariates -R /$Genomes/Broad/Human/b37/human_g1k_v37.fasta -I $Projects/data/SampleA_bowtie.gatk.realign.bam -nt 8 -l INFO -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate -log RECAL.log -recalFile RECAL.csv --knownSites $Genomes/Broad/Human/b37/dbsnp_132.b37.vcf
INFO  14:01:25,878 HelpFormatter - Date/Time: 2011/11/24 14:01:25
INFO  14:01:25,878 HelpFormatter - ---------------------------------------------------------------------------------
INFO  14:01:25,878 HelpFormatter - ---------------------------------------------------------------------------------
INFO  14:01:26,052 RodBindingArgumentTypeDescriptor - Dynamically determined type of $Genomes/Broad/Human/b37/dbsnp_132.b37.vcf to be VCF
INFO  14:01:26,064 GenomeAnalysisEngine - Strictness is SILENT
INFO  14:01:26,815 RMDTrackBuilder - Loading Tribble index from disk for file $Genomes/Broad/Human/b37/dbsnp_132.b37.vcf
INFO  14:01:30,532 MicroScheduler - Running the GATK in parallel mode with 8 concurrent threads
INFO  14:01:32,326 CountCovariatesWalker - The covariates being used here:
INFO  14:01:32,327 CountCovariatesWalker -      ReadGroupCovariate
INFO  14:01:32,327 CountCovariatesWalker -      QualityScoreCovariate
INFO  14:01:32,327 CountCovariatesWalker -      CycleCovariate
INFO  14:01:32,328 CountCovariatesWalker -      DinucCovariate
INFO  14:01:41,189 CountCovariatesWalker - Writing raw recalibration data...
INFO  14:01:44,145 HttpMethodDirector - I/O exception (java.net.ConnectException) caught when processing request: Connection refused
INFO  14:01:44,146 HttpMethodDirector - Retrying request
INFO  14:01:44,149 HttpMethodDirector - I/O exception (java.net.ConnectException) caught when processing request: Connection refused
INFO  14:01:44,149 HttpMethodDirector - Retrying request
INFO  14:01:44,152 HttpMethodDirector - I/O exception (java.net.ConnectException) caught when processing request: Connection refused
INFO  14:01:44,153 HttpMethodDirector - Retrying request
INFO  14:01:44,155 HttpMethodDirector - I/O exception (java.net.ConnectException) caught when processing request: Connection refused
INFO  14:01:44,155 HttpMethodDirector - Retrying request
INFO  14:01:44,158 HttpMethodDirector - I/O exception (java.net.ConnectException) caught when processing request: Connection refused
INFO  14:01:44,158 HttpMethodDirector - Retrying request
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 1.3-17-gc62082b):
##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
##### ERROR Please do not post this error to the GATK forum
##### ERROR
##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
##### ERROR Visit our wiki for extensive documentation [url]http://www.broadinstitute.org/gsa/wiki[/url]
##### ERROR Visit our forum to view answers to commonly asked questions [url]http://getsatisfaction.com/gsa[/url]
##### ERROR
##### ERROR MESSAGE: Bad input: Could not find any usable data in the input BAM file(s).
##### ERROR ------------------------------------------------------------------------------------------

**bunburillo** · 10-23-2012, 04:11 AM

Hi you all and congratulations for this useful thread.

I am trying to reproduce the pipeline posted at the beginning of the post as an alternative way for SNP analysis.
I am not actually experienced in NGS but I have to deal with the results of exome analyses coming from MiSeq sequencer and I would like to improve (compare) the results obtained trhough the MiSeq machine (BWA, CASAVA...).
According to the tutorial posted by ulz_peter (thanks again) I have performed the initial reference genome indexing (hg19) with the last updated version of bwa (0.6.2) and I obtained 5 different files as a result

hg19.amb
hg19.ann
hg19.bwt
hg19.pac
hg19.sa

According to other threads (http://seqanswers.com/forums/showthread.php?t=20705) it seems that the expected number of resulting files is 8. May I continue with this five files or it should be better to work with an earlier version of bwa? just in order to be able to reproduce the pipeline here described.

On the other hand, and thinking on the next step in the pipe, according to the BWA alignment options suggested in the tutorial:

"the -I option tells BWA to use Illumina1.3+ qualities"

but if I am not misunderstood, Miseq fastq results are in Sanger format (Illumina 1.8+), so may I use the -I option or not?

I think I am asking for very basic things but you know, basic knowledge is crucial to understand complexity. So I´ll be grateful if anuone could help me. I promise to continue asking when I have a doubt.

Thanks in advance

**sirmark** · 03-15-2013, 01:43 AM

I think it's important add in manual and in the wiki to add that vcf file, hg19.fasta
are in GATK bundle to which it's possible to access with an ftp client:
GATK budle ftp with an ftp client
http://gatkforums.broadinstitute.org...lic-ftp-server

I think that it's an important step to add in wiki

SEQanswers

http://seqanswers.com/wiki/How-to/exome_analysis

**carolW** · 04-18-2013, 01:02 AM

bwa index file of hg19

Hi,
As the index file of hg19 takes time, is it possible to download the built version from somewhere?

Thanks,

Carol,

Originally posted by ulz_peter View Post

Hi Folks,

As I was writing a short guide of Exome analysis in our Institute, I thought it might be of some use to others especially for newbies, who need some kind of starting point to get to analysis of exome data (pretty much like the RNA-seq manual I once read in an older thread...). Instead of explaining everything in 100 new threads one could then point to that manual...

It is the way we do exome analysis at our Institute, but I would be happy if people help improve the manual, add their knowledge and expand it, like a common knowledge base for exome-level analysis.

I attached the pdf version and a .doc version within a zip folder, as the filesize was too large for uploading the doc file alone.

The most updated version can be found in the SeqWiki (http://seqanswers.com/wiki/How-to/exome_analysis)
(just to make it clear, it is not regularly updated and it's only goal is to get people started on the use of tools often used in exome sequencing)

Any comments highly appreciated!

P.S. I added a (very) short visualization chapter

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 12 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News