Seqanswers Leaderboard Ad

**ulz_peter** · 01-18-2012, 05:30 AM

Hi Mali,
The 1000Genomes people tend to put many samples in a single BAM file separated through their Readgroup tags. GATK states VariantRecalibration needs a lot of SNPs to work smoothly, so since we analyse very few Exomes I had to switch to VariantFiltrator.
I'd do the analysis separately for each sample and look afterwards to combine the data.
In case you're searching for a disease causing gene in a set of patients I'd recommend to have a look at IBD2 (http://compbio.charite.de/contao/ind...cles/xibd.html). It could extract regions with same haplotypes for different patients. THat decreases the amount of SNPs drastically...

You're very welcome,
Peter

**Mali Salmon** · 01-18-2012, 05:39 AM

Wow, thanks a lot Peter for the helpful information
Mali

**cllorens** · 02-03-2012, 09:48 AM

Hi Peter and everybody

First of all. I want to thank you for this excellent pipeline and all explanations. I downloaded the pdf you gave in the first post and was able to reproduce the steps therein until arriving to the section 2.6 "Quality score recalibration" where i
am now stopped and would really appreciate if you or anybody else could clarify me some questions.

In this section 2.6. you suggest to run the Step Count covariates of GATK using the option --DBSNP with the lastest SNP release version of UCSC (in your example dbsnp132.txt i took snp135 from UCSC). About this i am wondered about how to resolve the following troubles I met in order to keep going the protocol:

1) In the version (i think the last one) i have of GATK the option -DBSNP does not exist. I saw the options and check threads and i think that the option DBSNP is now called "knownSites". I guess so could you confirm me that this is correct?

2) "KnownSites" let me to run Count covariates but it gives me the error above, asking me to use a SNP database in the VCF format of the 1000 genomes project.

------------------------------------------------------------------------------------------------------------
~/assembling/tools/gatk/GenomeAnalysisTK-1.4-14-g2e47336> java -Xmx4g -jar GenomeAnalysisTK.jar -l INFO -R hg19ss.fasta -knownSites:snp135.txt -I illumina.marked.realigned.fixed.bam -T CountCovariates -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate -recalFile SNP.recal_data.csv
INFO 17:45:22,015 HelpFormatter - ---------------------------------------------------------------------------------
INFO 17:45:22,016 HelpFormatter - The Genome Analysis Toolkit (GATK) v1.4-14-g2e47336, Compiled 2012/01/11 11:42:24
INFO 17:45:22,016 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 17:45:22,016 HelpFormatter - Please view our documentation at http://www.broadinstitute.org/gsa/wiki
INFO 17:45:22,016 HelpFormatter - For support, please view our support site at http://getsatisfaction.com/gsa
INFO 17:45:22,017 HelpFormatter - Program Args: -l INFO -R hg19ss.fasta -knownSites:snp135.txt -I illumina.marked.realigned.fixed.bam -T CountCovariates -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate -recalFile SNP.recal_data.csv
INFO 17:45:22,017 HelpFormatter - Date/Time: 2012/02/03 17:45:22
INFO 17:45:22,017 HelpFormatter - ---------------------------------------------------------------------------------
INFO 17:45:22,017 HelpFormatter - ---------------------------------------------------------------------------------
INFO 17:45:22,027 GenomeAnalysisEngine - Strictness is SILENT
INFO 17:45:22,059 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 17:45:22,071 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01
INFO 17:45:23,776 GATKRunReport - Uploaded run statistics report to AWS S3
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 1.4-14-g2e47336):
##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
##### ERROR Please do not post this error to the GATK forum
##### ERROR
##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
##### ERROR Visit our wiki for extensive documentation http://www.broadinstitute.org/gsa/wiki
##### ERROR Visit our forum to view answers to commonly asked questions http://getsatisfaction.com/gsa
##### ERROR
##### ERROR MESSAGE: Invalid command line: This calculation is critically dependent on being able to skip over known variant sites. Please provide a VCF file containing known sites of genetic variation.
##### ERROR ------------------------------------------------------------------------------------------

-----------------------------------------------------------------------------------------------------------------

3) My question (but only if my question to step 1 is that i was doing the correct think) here is about if UCSC does provide any tool to convert plain files to VCF or if is there any conversor available to convert UCSC SNPs DBs into VCF?

4) If there is not a conversor from UCSC to VCF i think I could get the dbsnp from the NCBI FTP but now the question is if i do that could I use this NCBI dbSNP over exome data mapped over the hg19 of the UCSC?? or i should repeat the whole protocol
from the beginning using this time the human genome downloaded from NCBI as a reference?

thank you in advance
Carlos

**Jon_Keats** · 02-03-2012, 11:51 AM

Hi Carlos,

1)The dbSNP to knownsites change is correct as you figured out the thread is using an older version than the current GATK 1.4.x

2)As you notice the use of properly formated VCF files is pretty uniform now in the GATK so I would move to using those files.

3)You should not have an issue using a VCF from NCBI though you may need to modify to match by adding chr to each chromosome postion. Anyone that follows my threads would see I'd strongly recommend not using UCSC based references or annotations, remember GATK is largely related to 1000 genomes which uses Sanger accepted references (ie. GRCh37).

4) Most of the GATK steps will work better if you use the provided files on the GSA ftp site as they are all formatted correctly though they expect reference files from NCBI,ensembl,1000 genomes. However, last I check they were dbSNP132 versions.

5) If you are interested we have posted a pipeline that implements most of the suggestions on this thread less a couple of changes to match the current GATK best practice guidelines for exomes. (http://seqanswers.com/forums/showthr...?t=4589&page=4) or (www.keatslab.org)

**cllorens** · 02-03-2012, 02:04 PM

Hi Jon
Thank you for your prompt and instructive answer. I going to check the posts and links you indicate here.

**cllorens** · 02-08-2012, 10:08 AM

The GS ftp site of GATK is just exactly what i needed. thank you Jon.

**sehrrot** · 02-27-2012, 07:38 PM

thanks for the great manual.

I have a short question:
I'm analysing Nimblegen exome and not sure which bed file I have to use for SNP calling. The bed file from nimblegen is not proper format compared to the bed file that Broad inst. provides. Otherwise, can I just use the bed file that the manual recommends (10bp interval target bed file from ucsc)?

**Orr Shomroni** · 02-27-2012, 10:55 PM

sehrrot, you can also download it from the original Nimblegen website:

http://www.nimblegen.com/products/seqcap/ez/v2/index.html

Just open "Design and annotation files", and version 2 of the BED file should be there for download. Make sure also that the Nimblegen kit is the same, because they recently started selling version 3

**sehrrot** · 02-27-2012, 11:16 PM

Thanks Orr

I've tried that with V2 bed file but got an error message

File associated with name ../../apps/annotations/Design_Annotation_files/Target_Regions/SeqCap_EZ_Exome_v2.bed is malformed: Couldn't parse line 'track name=target_region description="Target Regions

I cannot exactly understand why the format of bed files is different - ucsc bed file and nimblegen one.

**Orr Shomroni** · 02-28-2012, 12:08 AM

What software do you use to integrate the BED file into the analysis? I'm using samtools, and it works fine.

Did you get a BED file for the Nimblegen kit from the UCSC website? Can you give me a link to it? I don't know how the BED file from UCSC looks like, but the Nimblegen one starts with this line:

track name=target_region description="Target Regions"

Also, it has 2 sections: section 1 indicates positions of enrichment regions, and section 2 indicates complete probe regions (so enrichments regions + a few bases before and after them). So it could be that your software cannot deal with the header in the second section.

**cllorens** · 02-28-2012, 04:42 AM

Sehrrot
Try to see if your bed file has header and if so remove it and try to run the command again. I had the same bug report and doing so it worked succesfully.
Carlos

**ddaneels** · 04-12-2012, 09:38 PM

Hello,

Really helpful manual! Nice work!

I saw that you are working with Illumina data. I was wondering if you clipped of the 3' adaptor sequence? If so, with which tool?

**ulz_peter** · 04-12-2012, 10:03 PM

thanks a lot. Up to now, there was no need to remove adaptor sequences from the exome-sequencing data we got. However, for other experiments I was using the cutadapt tool (http://code.google.com/p/cutadapt/). Many people use the FastX-Toolkit, which somehow didn't work for my data (from the MiSeq)...

**ddaneels** · 04-12-2012, 10:50 PM

Thanks for the fast reply!

I do have another question: I downloaded the newest UCSC release of the human genome. 7When I unpack this .tar.gz file there are a lot of different files called f.e. chr17_gl000 ...

Should I also include these in the single genome reference file?

**ulz_peter** · 04-12-2012, 11:09 PM

That's a good question. Those files correspnd to contigs that have not been reliably placed on a chromosome due to problematic sequences. I personally just use chr1-22 X, Y and M. There should be some chrUn_... contigs as well, they have not been associated with a chromosome so far. It depends a bit what you want to do...
However, I am not sure if GATK is very happy with those chromosome files. I never tried it out.

Another thing to consider would be the alternative haplotypes for thr MHC and the other two regions. I am not sure how much they deviate from the original reference sequence, or if there is a good reason for including them.

Lots of things to consider, but for getting started I'd recommend to use only the chr1-22, X, Y and M(T) chromosomes (in this order)

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 27 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 26 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News