Unconfigured Ad

**jkerouac** · 01-17-2013, 10:13 AM

Yes it makes a lot of sense to do this. For example I have a modestly sized exome sequencing project of an extreme phenotype (50 and 50 samples). We filtered for highly deleterious variants, then compared what is in one group versus the other. What we found were a fair number false positives which fit a scenario where there was low coverage for this area (in general) but a few samples got up to a coverage depth (6-10 reads) where they were called. So there wasn't really a variant existing in one group that wasn't in the other, rather just stochastic calling of low coverage variants that gave a false positive. I think the slight unevenness of coverage from sample to sample in low coverage areas is a big problem.

I don't have an answer right now, I just started working on this problem today (hence my finding your question) but I will repost if I work a solution out. And if anyone else knows how to filter VCF files such that you only select variants that were at least "callable" in all or a defined proportion of samples, I would much appreciate it.

**DavyK** · 01-17-2013, 10:25 AM

Yes, filtering is clearly important, although you should always filter before comparing a two sample groups. Then you get into the issue of adjusting your filters based on what you see in a case vs control sample.

In any case, on more thorough reading of the GATK documentation website, filtering on READ depth is no longer recommended. Instead they suggest a number of filters that might (emphasis on might) help to rule out FPs.

For SNPs:

QD < 2.0
MQ < 40.0
FS > 60.0
HaplotypeScore > 13.0
MQRankSum < -12.5
ReadPosRankSum < -8.0

I added another filter though from the seqanswers exome sequencing analysis wiki

MQ0 >= 4 && ((MQ0 / (1.0 * DP)) > .01)

However your project sounds like it's adequately powered for you to run the variant quality score recalibration tool from the GATK. Whole-exome of more than 30 samples is stated as being the minimum, and it's shown to be better than hard filtering.

**jkerouac** · 01-17-2013, 10:39 AM

Thanks for the reply that is helpful.

Yes we used the VQSR tool, and by manual inspection of hundreds of calls it did a nice job. But it doesn't get around the false positive problem I described (which I thought you were describing also): that is, low coverage areas that vary in their ability to be called from one sample to the next.

Topics	Statistics	Last Post
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, Yesterday, 05:37 AM	0 responses 6 views 0 reactions	Last Post by SEQadmin2 Yesterday, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 16 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 51 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 110 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM

Unconfigured Ad

Filtering multisample vcf files on DP

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News