SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Samtools "is recognized as '*'" "truncated file" error axiom7 Bioinformatics 3 11-26-2014 02:53 AM
VCF - genotypes - "missing alleles" Elsie Bioinformatics 2 02-10-2013 01:08 PM
Can we know the "Direction" form a VCF file Bobak Bioinformatics 1 08-02-2012 02:22 AM
"allele balance ratio" and "quality by depth" in VCF files efoss Bioinformatics 2 10-25-2011 11:13 AM
The position file formats ".clocs" and "_pos.txt"? Ist there any difference? elgor Illumina/Solexa 0 06-27-2011 07:55 AM

Reply
 
Thread Tools
Old 03-12-2013, 11:43 AM   #1
slp
Junior Member
 
Location: NY, USA

Join Date: Dec 2010
Posts: 9
Default No "0/0" (homozygous ref) genotypes in VCF file

Hi,
I have got a vcf file from our collaborator which doesn't have any "0/0" or homozygous reference genotypes in it (which is hard to believe). Instead there are a lot of "./." genotypes. He says that they don't use vcf format in their pipeline but use sam2vcf.pl to convert to vcf format.

The entries in VCF file are like:
chr10 94025 . T C 123.00 PASS AC=1; AN=2; DP=55; GT: DP:GQ . . . . . . . . . . 0/1:55:123 . . . . . . . . .


Does anybody have an idea why is it so?

Thanks
S
slp is offline   Reply With Quote
Old 03-12-2013, 11:50 AM   #2
vivek_
PhD Student
 
Location: Denmark

Join Date: Jul 2012
Posts: 164
Default

Unless you force the genotyper to genotype all sites and emit them in the output you will not get sites where all samples are homozygous reference or in other terms you will only have loci where atleast one sample has a non-reference allele.

Last edited by vivek_; 03-12-2013 at 12:27 PM.
vivek_ is offline   Reply With Quote
Old 03-12-2013, 12:16 PM   #3
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

Quote:
Originally Posted by slp View Post
Hi,
I have got a vcf file from our collaborator which doesn't have any "0/0" or homozygous reference genotypes in it (which is hard to believe). Instead there are a lot of "./." genotypes. He says that they don't use vcf format in their pipeline but use sam2vcf.pl to convert to vcf format.

The entries in VCF file are like:
chr10 94025 . T C 123.00 PASS AC=1; AN=2; DP=55; GT: DP:GQ . . . . . . . . . . 0/1:55:123 . . . . . . . . .


Does anybody have an idea why is it so?

Thanks
S
That's weird. I would guess that the dots are to represent 0/0 genotypes, but its nice to have the quality score of the 0/0 calls.

Because yeah, there should be some loci where some, but not all of the samples are homozygous reference.
swbarnes2 is offline   Reply With Quote
Old 03-13-2013, 08:58 AM   #4
oiiio
Senior Member
 
Location: USA

Join Date: Jan 2011
Posts: 105
Default

I thought it was standard that "./." meant there was no confident genotype to be called, and "0/0" in that case that homozygous ref was called. The best thing to do is just clarify this with your collaborators.
oiiio is offline   Reply With Quote
Old 03-13-2013, 11:10 AM   #5
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

Quote:
Originally Posted by oiiio View Post
I thought it was standard that "./." meant there was no confident genotype to be called, and "0/0" in that case that homozygous ref was called. The best thing to do is just clarify this with your collaborators.
I don't think that's standard. I don't see anything about that usage in the vcf standard. I use samtools mpileup to make vcfs, and it's never done that. It might say "0/0" with a quality score of 3, but it never says nothing.

And even if that were standard, likely there should be one locus where one sample has a possible SNP, and at least one other sample is clearly homozygous.

Not knowing the quality of the homozygous reference calls is going to make it very difficult to judge the quality of the deviations from that.
swbarnes2 is offline   Reply With Quote
Old 03-13-2013, 11:18 AM   #6
oiiio
Senior Member
 
Location: USA

Join Date: Jan 2011
Posts: 105
Default

You are right, to clarify this is actually just GATK's behavior when told to emit all sites.
oiiio is offline   Reply With Quote
Old 03-13-2013, 05:36 PM   #7
BAMseek
Senior Member
 
Location: St. Louis, MO, USA

Join Date: Apr 2011
Posts: 124
Default

AFAIK, "./." means the call at that position is missing. It could be missing for a variety of reasons. It might be missing because a call was made, but it didn't meet some threshold and was filtered out. It could also be missing because multiple VCFs are merged together, and samples that don't have a call at a position are listed as "./." Usually by default, when someone runs a variant caller on an individual sample, only the variant calls are emitted so there are no "0/0" reference calls in the VCF file, which is fine if you want to know the variants of a single sample. This becomes a problem when you want to compare SNP calls across samples, because you can't assume that an absent call in the VCF means it was a reference call (because it could have also been a position where the caller couldn't make an accurate call). One option is to force the caller to emit all calls, even reference calls. This will generate very large files. Another option is to call SNPs simultaneously on the samples and output a multi-sample VCF (which can be done with samtools mpileup or GATK). For this solution, if one of the samples has a variant, then the calls for all the other samples will be emitted too, even if it's reference or low quality. I prefer this way, because the VCF file is still small, and it is then an easy task to find positions that have different calls across the samples.

Hope that is of some help. We've wrestled with how best to handle this quite a bit, so definitely interesting to hear how others manage it.

Justin
BAMseek is offline   Reply With Quote
Old 03-13-2013, 05:52 PM   #8
slp
Junior Member
 
Location: NY, USA

Join Date: Dec 2010
Posts: 9
Default

Quote:
Originally Posted by BAMseek View Post
Another option is to call SNPs simultaneously on the samples and output a multi-sample VCF (which can be done with samtools mpileup or GATK). For this solution, if one of the samples has a variant, then the calls for all the other samples will be emitted too, even if it's reference or low quality. I prefer this way, because the VCF file is still small, and it is then an easy task to find positions that have different calls across the samples.

Justin

Precisely, that's what I've always seen. The multiple-sample VCF files have variant in one of the samples and the rest of the sample either are "0/0" or "./.". I asked the collaborator and they say that in their files "./." corresponds to homozygous reference.

But I noticed this weired genotyping in the mouse multi-strain/sample InDels vcf (http://www.sanger.ac.uk/resources/mouse/genomes/) file too. For all the positions that have an InDel called in one of the mouse strains, no strain had 0/0 genotype (everything else was "./."). There were no 0/0 in C57BL/6NJ strain too which is considered to be the reference strain.

Clearly this is not in the standard vcf4.1 format. I've posted question to mousegenomes@sanger.ac.uk and am waiting to hear from them.
slp is offline   Reply With Quote
Old 03-13-2013, 06:05 PM   #9
BAMseek
Senior Member
 
Location: St. Louis, MO, USA

Join Date: Apr 2011
Posts: 124
Default

Quote:
Originally Posted by slp View Post
I asked the collaborator and they say that in their files "./." corresponds to homozygous reference.
This is a guess, but it may be that SNPs were called on the samples individually, and then those VCFs were merged into one multi-sample VCF. Provided there was sufficient evidence at a position for the caller to make a confident call, then those ./. calls would be homozygous reference calls. However, it could also be the case that the call is absent because the caller couldn't make a call at that position. The way I've dealt with this situation before is that I would go back to the BAM file to figure out the depth, and say if the depth is above a certain number, then it is a ref call, otherwise it is a no-call. That's the best I could think to do, but less than optimal.
BAMseek is offline   Reply With Quote
Reply

Tags
vcf

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:26 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO