Dear all,
my lab performed whole exome sequencing of 15 human samples and I performed the variant calling. However, even after stringent filtering, I ended up with about 75.800 variants, (present in at least 1/15 samples, unique variants per sample: 35.000-42.000), Ti/Tv ratio=2.6. Compared to other studies, this number is too high. For example O'Rawe at al. http://genomemedicine.com/content/5/3/28 performed per-sample variant calling (not multi-sample variant calling like me) and checked the concordance between 5 variant calling pipelines. Considering only GATK and samtools, they found 26.323 variants in the call intersection (average over all samples, I think). I ask you all for your opinon on why I have this high number of variants with a low Ti/Tv ratio and ask for proposals what I can do about it. I realize that people often apply additional filters to extract a subset of interesting variants (e.g. only variants with AF < 0.05), but I think some of my variants must be false positives and I want to eliminate those before I do downstream analysis. Here are the necessary information, thank you very much in advance.
my lab performed whole exome sequencing of 15 human samples and I performed the variant calling. However, even after stringent filtering, I ended up with about 75.800 variants, (present in at least 1/15 samples, unique variants per sample: 35.000-42.000), Ti/Tv ratio=2.6. Compared to other studies, this number is too high. For example O'Rawe at al. http://genomemedicine.com/content/5/3/28 performed per-sample variant calling (not multi-sample variant calling like me) and checked the concordance between 5 variant calling pipelines. Considering only GATK and samtools, they found 26.323 variants in the call intersection (average over all samples, I think). I ask you all for your opinon on why I have this high number of variants with a low Ti/Tv ratio and ask for proposals what I can do about it. I realize that people often apply additional filters to extract a subset of interesting variants (e.g. only variants with AF < 0.05), but I think some of my variants must be false positives and I want to eliminate those before I do downstream analysis. Here are the necessary information, thank you very much in advance.
- Samples: 4 samples are affected by familial PD, 6 samples are healthy (or yet undiagnosed) family member, 5 samples are unrelated healthy people
- Exome Enrichment Kit: Illumina TruSeq (~62 Mb target region)
- Sequenced on Illumina MiSeq, multiple runs, for each run, for each sample, alignment created with BWA and then for each sample merged with Picard
- Coverage on 15 samples (not uniform between samples): coverage median 20X - 40X, about 40-60% of target region (TruSeq) covered by at least 30X
- 1st processing: BWA + GATK best practise (MD, IR, BQSR, multi sample variant calling with UnifiedGenotyper only in target region + 20 bp offset, VQSR) --> ~ 124.000 variants
then manual filter based on alternative allele count. Only accepted variants where at least one sample had alt allele frequency >= 0.35 and nr of reads that support alt allele >= 10 --> 98.000 variants remaining - 2nd processing: BWA + GATK MD, IR, BQSR + variant calling with samtools / bcftools only in target region + bcftools varfilter total min cov 60, max cov 1500 --> 109.000 variants
then manual filter based on alternative allele count as for 1st processing. Nr of variants remaining --> 81.000 - intersection of the 98.000 GATK variants and the 81.000 samtools variants --> ~ 75.800 variants that are present in at least one sample, unique variants per sample: 35.000-42.000
Comment