Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • slp
    Junior Member
    • Dec 2010
    • 9

    No "0/0" (homozygous ref) genotypes in VCF file

    Hi,
    I have got a vcf file from our collaborator which doesn't have any "0/0" or homozygous reference genotypes in it (which is hard to believe). Instead there are a lot of "./." genotypes. He says that they don't use vcf format in their pipeline but use sam2vcf.pl to convert to vcf format.

    The entries in VCF file are like:
    chr10 94025 . T C 123.00 PASS AC=1; AN=2; DP=55; GT: DP:GQ . . . . . . . . . . 0/1:55:123 . . . . . . . . .


    Does anybody have an idea why is it so?

    Thanks
    S
  • vivek_
    PhD Student
    • Jul 2012
    • 164

    #2
    Unless you force the genotyper to genotype all sites and emit them in the output you will not get sites where all samples are homozygous reference or in other terms you will only have loci where atleast one sample has a non-reference allele.
    Last edited by vivek_; 03-12-2013, 12:27 PM.

    Comment

    • swbarnes2
      Senior Member
      • May 2008
      • 910

      #3
      Originally posted by slp View Post
      Hi,
      I have got a vcf file from our collaborator which doesn't have any "0/0" or homozygous reference genotypes in it (which is hard to believe). Instead there are a lot of "./." genotypes. He says that they don't use vcf format in their pipeline but use sam2vcf.pl to convert to vcf format.

      The entries in VCF file are like:
      chr10 94025 . T C 123.00 PASS AC=1; AN=2; DP=55; GT: DP:GQ . . . . . . . . . . 0/1:55:123 . . . . . . . . .


      Does anybody have an idea why is it so?

      Thanks
      S
      That's weird. I would guess that the dots are to represent 0/0 genotypes, but its nice to have the quality score of the 0/0 calls.

      Because yeah, there should be some loci where some, but not all of the samples are homozygous reference.

      Comment

      • oiiio
        Senior Member
        • Jan 2011
        • 105

        #4
        I thought it was standard that "./." meant there was no confident genotype to be called, and "0/0" in that case that homozygous ref was called. The best thing to do is just clarify this with your collaborators.

        Comment

        • swbarnes2
          Senior Member
          • May 2008
          • 910

          #5
          Originally posted by oiiio View Post
          I thought it was standard that "./." meant there was no confident genotype to be called, and "0/0" in that case that homozygous ref was called. The best thing to do is just clarify this with your collaborators.
          I don't think that's standard. I don't see anything about that usage in the vcf standard. I use samtools mpileup to make vcfs, and it's never done that. It might say "0/0" with a quality score of 3, but it never says nothing.

          And even if that were standard, likely there should be one locus where one sample has a possible SNP, and at least one other sample is clearly homozygous.

          Not knowing the quality of the homozygous reference calls is going to make it very difficult to judge the quality of the deviations from that.

          Comment

          • oiiio
            Senior Member
            • Jan 2011
            • 105

            #6
            You are right, to clarify this is actually just GATK's behavior when told to emit all sites.

            Comment

            • BAMseek
              Senior Member
              • Apr 2011
              • 124

              #7
              AFAIK, "./." means the call at that position is missing. It could be missing for a variety of reasons. It might be missing because a call was made, but it didn't meet some threshold and was filtered out. It could also be missing because multiple VCFs are merged together, and samples that don't have a call at a position are listed as "./." Usually by default, when someone runs a variant caller on an individual sample, only the variant calls are emitted so there are no "0/0" reference calls in the VCF file, which is fine if you want to know the variants of a single sample. This becomes a problem when you want to compare SNP calls across samples, because you can't assume that an absent call in the VCF means it was a reference call (because it could have also been a position where the caller couldn't make an accurate call). One option is to force the caller to emit all calls, even reference calls. This will generate very large files. Another option is to call SNPs simultaneously on the samples and output a multi-sample VCF (which can be done with samtools mpileup or GATK). For this solution, if one of the samples has a variant, then the calls for all the other samples will be emitted too, even if it's reference or low quality. I prefer this way, because the VCF file is still small, and it is then an easy task to find positions that have different calls across the samples.

              Hope that is of some help. We've wrestled with how best to handle this quite a bit, so definitely interesting to hear how others manage it.

              Justin

              Comment

              • slp
                Junior Member
                • Dec 2010
                • 9

                #8
                Originally posted by BAMseek View Post
                Another option is to call SNPs simultaneously on the samples and output a multi-sample VCF (which can be done with samtools mpileup or GATK). For this solution, if one of the samples has a variant, then the calls for all the other samples will be emitted too, even if it's reference or low quality. I prefer this way, because the VCF file is still small, and it is then an easy task to find positions that have different calls across the samples.

                Justin

                Precisely, that's what I've always seen. The multiple-sample VCF files have variant in one of the samples and the rest of the sample either are "0/0" or "./.". I asked the collaborator and they say that in their files "./." corresponds to homozygous reference.

                But I noticed this weired genotyping in the mouse multi-strain/sample InDels vcf (http://www.sanger.ac.uk/resources/mouse/genomes/) file too. For all the positions that have an InDel called in one of the mouse strains, no strain had 0/0 genotype (everything else was "./."). There were no 0/0 in C57BL/6NJ strain too which is considered to be the reference strain.

                Clearly this is not in the standard vcf4.1 format. I've posted question to [email protected] and am waiting to hear from them.

                Comment

                • BAMseek
                  Senior Member
                  • Apr 2011
                  • 124

                  #9
                  Originally posted by slp View Post
                  I asked the collaborator and they say that in their files "./." corresponds to homozygous reference.
                  This is a guess, but it may be that SNPs were called on the samples individually, and then those VCFs were merged into one multi-sample VCF. Provided there was sufficient evidence at a position for the caller to make a confident call, then those ./. calls would be homozygous reference calls. However, it could also be the case that the call is absent because the caller couldn't make a call at that position. The way I've dealt with this situation before is that I would go back to the BAM file to figure out the depth, and say if the depth is above a certain number, then it is a ref call, otherwise it is a no-call. That's the best I could think to do, but less than optimal.

                  Comment

                  Latest Articles

                  Collapse

                  • SEQadmin2
                    Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                    by SEQadmin2


                    I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


                    Here are nine questions we think about, in roughly the order they matter, before...
                    06-18-2026, 07:11 AM
                  • SEQadmin2
                    From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                    by SEQadmin2


                    Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                    The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                    ...
                    06-02-2026, 10:05 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by SEQadmin2, 06-17-2026, 06:09 AM
                  0 responses
                  26 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 06-09-2026, 11:58 AM
                  0 responses
                  43 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 06-05-2026, 10:09 AM
                  0 responses
                  48 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 06-04-2026, 08:59 AM
                  0 responses
                  49 views
                  0 reactions
                  Last Post SEQadmin2  
                  Working...