Multi-sample VCF generated in GATK is not working with novoutil IUPAC

sdarko

Member

Join Date: Apr 2009
Posts: 51

Multi-sample VCF generated in GATK is not working with novoutil IUPAC

07-19-2012, 04:55 AM

Hi,

We are trying to find SNPs in a few loci of interest in rhesus macaque. To that end, we have sequenced exome sequences and I've gone through the process of calling SNPs using GATK and I've come up with a multi-sample VCF file.

Now we want to actually create alternative fasta references for each macaque at our loci of interest. I thought that Novocraft utility "novoutil iupac" would do exactly what I want it to do, which is (from the most recent version):

Code:

Novoutil iupac
      1. Add option ( -g ) to substitute genotype for the reference base. Without
         the -g option ambiguous bases are formed by combining the reference base
         and the genotype.
      2. Add option -i to substitute homozygous indel calls into the new reference.
         This is useful for reference guided assembly.
      3. Add option -q 99 that sets a lower quality limit (default 30) on usable VCF
         data lines.
      4. If the REF column of VCF file shows an N then IUPAC code is formed from the
         ALT alleles.

However, when I run novoutil iupac with my VCF file and the reference genome, I get the following error:

Code:

# novoutil iupac (V2.08.02) - Build Jun 28 2012 @ 16:23:10
# (C) 2008,2009,2010,2011 NovoCraft Technologies Sdn Bhd.
Loading VCF file...
novoutil: src/vcfsnps.cc:152: bool vcffile::hetero(tabbed&): Assertion `*gt == '|' || *gt == '/'' failed.

So it clearly doesn't like something about my VCF file, which looks something like this:

Code:

##fileformat=VCFv4.1
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=Dels,Number=1,Type=Float,Description="Fraction of Reads Containing Spanning Deletions">
##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias">
##INFO=<ID=HRun,Number=1,Type=Integer,Description="Largest Contiguous Homopolymer Run of Variant Allele In Either Direction">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##INFO=<ID=SB,Number=1,Type=Float,Description="Strand Bias">
##UnifiedGenotyper="analysis_type=UnifiedGenotyper input_file=[sorted_realigned_RG_20120712_unmasked_lane_1_index_3.bam, sorted_realigned_RG_20120712_unmasked_lane_1_index_8.bam, sorted_realigned_RG_20120712_unmasked_lane_2_index_1.bam, sorted_realigned_RG_20120712_unmasked_lane_2_index_2.bam, sorted_realigned_RG_20120712_unmasked_lane_5_index_6.bam, sorted_realigned_RG_20120712_unmasked_lane_5_index_9.bam, sorted_realigned_RG_20120712_unmasked_lane_6_index_4.bam, sorted_realigned_RG_20120712_unmasked_lane_6_index_5.bam, sorted_realigned_RG_20120712_unmasked_lane_3_index_7.bam] read_buffer_size=null phone_home=STANDARD gatk_key=null read_filter=[] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL reference_sequence=/vrc1_data/douek_lab/reference_sets/novoalign/alt_macaque/rhesus_TR_IG_unmasked.fa nonDeterministicRandomSeed=false downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=250 baq=OFF baqGapOpenPenalty=40.0 performanceLog=null useOriginalQualities=false BQSR=null quantize_quals=-1 defaultBaseQualities=-1 validation_strictness=SILENT unsafe=null num_threads=1 num_cpu_threads=null num_io_threads=null num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false logging_level=INFO log_to_file=null help=false genotype_likelihoods_model=BOTH p_nonref_model=EXACT heterozygosity=0.0010 pcr_error_rate=1.0E-4 genotyping_mode=DISCOVERY output_mode=EMIT_VARIANTS_ONLY standard_min_confidence_threshold_for_calling=50.0 standard_min_confidence_threshold_for_emitting=30.0 noSLOD=false annotateNDA=false alleles=(RodBinding name= source=UNBOUND) min_base_quality_score=17 max_deletion_fraction=0.05 max_alternate_alleles=3 min_indel_count_for_genotyping=5 min_indel_fraction_per_sample=0.25 indel_heterozygosity=1.25E-4 indelGapContinuationPenalty=10 indelGapOpenPenalty=45 indelHaplotypeSize=80 noBandedIndel=false indelDebug=false ignoreSNPAlleles=false dbsnp=(RodBinding name= source=UNBOUND) comp=[] out=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub NO_HEADER=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub sites_only=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub debug_file=null metrics_file=null annotation=[] excludeAnnotation=[] filter_mismatching_base_and_quals=false"
##contig=<ID=IgH1,length=610659>
##contig=<ID=IgH2,length=260169>
##contig=<ID=IgK1,length=977354>
##contig=<ID=IgK2,length=4152696>
##contig=<ID=IgL1,length=20629>
##contig=<ID=IgL2,length=7376870>
##contig=<ID=IgL3,length=2230>
##contig=<ID=IgL4,length=7654437>
##contig=<ID=IgL5,length=8461>
##contig=<ID=IgL6,length=8824>
##contig=<ID=IgL7,length=1082>
##contig=<ID=IgL8,length=4740>
##contig=<ID=IgL9,length=13131495>
##contig=<ID=TRA1,length=23168555>
##contig=<ID=TRB1,length=19618912>
##contig=<ID=TRD1,length=1136629>
##contig=<ID=TRG1,length=17450015>
##reference=file:///vrc1_data/douek_lab/reference_sets/novoalign/alt_macaque/rhesus_TR_IG_unmasked.fa
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	monkey1_3	monkey1_8	monkey2_1	monkey2_2	monkey3_7	monkey5_6	monkey5_9	monkey6_4	monkey6_5
TRG1	2855	.	G	T	75.71	PASS	AC=6;AF=1.00;AN=6;DP=3;Dels=0.00;FS=0.000;HRun=3;HaplotypeScore=0.0000;MQ=41.02;MQ0=0;QD=25.24;SB=-0.02	GT:AD:DP:GQ:PL	./.	1/1:0,1:1:3.01:34,3,0	./.	1/1:0,1:1:3.01:34,3,0	./.	./.	./.	./.	1/1:0,1:1:3.01:42,3,0
TRG1	3864	.	A	G	42.75	LowQual	AC=4;AF=0.50;AN=8;BaseQRankSum=0.000;DP=4;Dels=0.00;FS=0.000;HRun=1;HaplotypeScore=0.0000;MQ=150.00;MQ0=0;MQRankSum=-1.026;QD=21.37;ReadPosRankSum=1.026;SB=2.99	GT:AD:DP:GQ:PL	./.	1/1:0,1:1:3.01:42,3,0	./.	1/1:0,1:1:3.01:42,3,0	0/0:1,0:1:3.01:0,3,41	./.	0/0:1,0:1:3.01:0,3,42	./.	./.
TRG1	4614	.	T	C	110.43	PASS	AC=7;AF=0.70;AN=10;BaseQRankSum=0.000;DP=6;Dels=0.00;FS=0.000;HRun=2;HaplotypeScore=0.0000;MQ=101.00;MQ0=0;MQRankSum=1.380;QD=22.09;ReadPosRankSum=0.000;SB=-30.01	GT:AD:DP:GQ:PL	./.	0/1:1,1:2:32.56:35,0,33	1/1:0,1:1:3.01:41,3,0	./.	./.	0/0:1,0:1:3.01:0,3,34	1/1:0,1:1:3.01:37,3,0	./.	1/1:0,1:1:3.01:42,3,0
TRG1	4628	.	C	T	34.82	LowQual	AC=3;AF=0.75;AN=4;BaseQRankSum=-0.736;DP=3;Dels=0.00;FS=0.000;HRun=2;HaplotypeScore=0.0000;MQ=51.96;MQ0=0;MQRankSum=0.736;QD=11.61;ReadPosRankSum=0.736;SB=-29.60	GT:AD:DP:GQ:PL	./.	0/1:1,1:2:35.21:35,0,35	./.	./.	./.	1/1:0,1:1:3.01:34,3,0	./.	./.	./.
TRG1	5257	.	T	A,G	75.47	PASS	AC=3,2;AF=0.38,0.25;AN=8;BaseQRankSum=0.622;DP=114;Dels=0.00;FS=0.000;HaplotypeScore=0.1835;MQ=13.79;MQ0=4;MQRankSum=-2.795;QD=1.54;ReadPosRankSum=-0.487;SB=-24.83	GT:AD:DP:GQ:PL	./.	1/1:2,4,14:21:3.01:34,3,0,34,3,34	./.	0/0:2,2,10:14:3.01:0,3,34,3,34,34	0/1:1,5,14:20:25.32:73,0,25,76,31,107	2/2:2,3,3:8:3.01:42,42,42,3,3,0	./.	./.	./.
TRG1	5268	.	C	T	61.81	PASS	AC=3;AF=0.30;AN=10;BaseQRankSum=-0.431;DP=99;Dels=0.00;FS=0.000;HRun=1;HaplotypeScore=0.5441;MQ=14.90;MQ0=2;MQRankSum=2.302;QD=1.63;ReadPosRankSum=0.825;SB=-24.38	GT:AD:DP:GQ:PL	0/0:11,1:12:3:0,3,25	1/1:16,1:18:3.01:34,3,0	./.	0/0:9,0:10:3.01:0,3,34	0/1:17,3:20:25.32:69,0,25	0/0:8,0:8:3.01:0,3,42	./.	./.	./.
TRG1	5275	.	G	C	62.34	PASS	AC=3;AF=0.30;AN=10;BaseQRankSum=-0.945;DP=56;Dels=0.00;FS=0.000;HRun=0;HaplotypeScore=1.0882;MQ=19.71;MQ0=1;MQRankSum=2.439;QD=2.97;ReadPosRankSum=0.154;SB=-24.38	GT:AD:DP:GQ:PL	0/0:6,1:8:3:0,3,25	1/1:9,1:11:3.01:34,3,0	./.	0/0:9,0:9:3.01:0,3,34	0/1:8,2:10:25.32:70,0,25	0/0:5,0:5:3.01:0,3,41	./.	./.	./.

Does anyone have any experience with this sort of issue?

Thanks,
Sam

Tags: None

Previous template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Today, 08:47 AM	0 responses 11 views 0 likes	Last Post by seqadmin Today, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

Multi-sample VCF generated in GATK is not working with novoutil IUPAC

Latest Articles

ad_right_rmr

News