I am looking for some advice on hard filtering based on allele balance (either over the entire cohort or by sample).
Sorry this post my be long but I wanted to give a little background. I have 94 exomes that have been put through the GATK pipeline with the most current best practices. I did VQSR and filtered for PASS variants. I am also setting my minimum depth for each genotype at 20. After doing association analysis in PLINK, it is obvious that I have false heterozygous genotypes and that is giving me significant associations that shouldn't be there. Sanger sequencing these significant SNP's is giving mostly ref/ref where GATK is calling ref/alt. When I look at the vcf of many of these sites, I can see that they just look "wrong". I don't know how to explain it any other way than that. Depth might be high but you have cases where the numbers are like 150/5 and it calls it a het. Sometimes there are no alt alleles and it calls it as a het. I realize that GATK hapcaller is outputting the most likely genotype based on a model and not just looking at raw counts. It was suggested to me that maybe I consider filtering by the allele balance or BAF. In GATK I can annotate my VCF with the allele balance across all samples and I can annotate the allele balance on a per sample basis. Now that I have done that, I am unsure what threshold to set (if any). I want to be able to filter out these "bad genotypes" before running association analysis. At this point would rather have no significant associations versus errors.
I was just hoping for some community feedback on this. Thanks for your help!
Sorry this post my be long but I wanted to give a little background. I have 94 exomes that have been put through the GATK pipeline with the most current best practices. I did VQSR and filtered for PASS variants. I am also setting my minimum depth for each genotype at 20. After doing association analysis in PLINK, it is obvious that I have false heterozygous genotypes and that is giving me significant associations that shouldn't be there. Sanger sequencing these significant SNP's is giving mostly ref/ref where GATK is calling ref/alt. When I look at the vcf of many of these sites, I can see that they just look "wrong". I don't know how to explain it any other way than that. Depth might be high but you have cases where the numbers are like 150/5 and it calls it a het. Sometimes there are no alt alleles and it calls it as a het. I realize that GATK hapcaller is outputting the most likely genotype based on a model and not just looking at raw counts. It was suggested to me that maybe I consider filtering by the allele balance or BAF. In GATK I can annotate my VCF with the allele balance across all samples and I can annotate the allele balance on a per sample basis. Now that I have done that, I am unsure what threshold to set (if any). I want to be able to filter out these "bad genotypes" before running association analysis. At this point would rather have no significant associations versus errors.
I was just hoping for some community feedback on this. Thanks for your help!