Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • No "0/0" (homozygous ref) genotypes in VCF file

    Hi,
    I have got a vcf file from our collaborator which doesn't have any "0/0" or homozygous reference genotypes in it (which is hard to believe). Instead there are a lot of "./." genotypes. He says that they don't use vcf format in their pipeline but use sam2vcf.pl to convert to vcf format.

    The entries in VCF file are like:
    chr10 94025 . T C 123.00 PASS AC=1; AN=2; DP=55; GT: DP:GQ . . . . . . . . . . 0/1:55:123 . . . . . . . . .


    Does anybody have an idea why is it so?

    Thanks
    S

  • #2
    Unless you force the genotyper to genotype all sites and emit them in the output you will not get sites where all samples are homozygous reference or in other terms you will only have loci where atleast one sample has a non-reference allele.
    Last edited by vivek_; 03-12-2013, 12:27 PM.

    Comment


    • #3
      Originally posted by slp View Post
      Hi,
      I have got a vcf file from our collaborator which doesn't have any "0/0" or homozygous reference genotypes in it (which is hard to believe). Instead there are a lot of "./." genotypes. He says that they don't use vcf format in their pipeline but use sam2vcf.pl to convert to vcf format.

      The entries in VCF file are like:
      chr10 94025 . T C 123.00 PASS AC=1; AN=2; DP=55; GT: DP:GQ . . . . . . . . . . 0/1:55:123 . . . . . . . . .


      Does anybody have an idea why is it so?

      Thanks
      S
      That's weird. I would guess that the dots are to represent 0/0 genotypes, but its nice to have the quality score of the 0/0 calls.

      Because yeah, there should be some loci where some, but not all of the samples are homozygous reference.

      Comment


      • #4
        I thought it was standard that "./." meant there was no confident genotype to be called, and "0/0" in that case that homozygous ref was called. The best thing to do is just clarify this with your collaborators.

        Comment


        • #5
          Originally posted by oiiio View Post
          I thought it was standard that "./." meant there was no confident genotype to be called, and "0/0" in that case that homozygous ref was called. The best thing to do is just clarify this with your collaborators.
          I don't think that's standard. I don't see anything about that usage in the vcf standard. I use samtools mpileup to make vcfs, and it's never done that. It might say "0/0" with a quality score of 3, but it never says nothing.

          And even if that were standard, likely there should be one locus where one sample has a possible SNP, and at least one other sample is clearly homozygous.

          Not knowing the quality of the homozygous reference calls is going to make it very difficult to judge the quality of the deviations from that.

          Comment


          • #6
            You are right, to clarify this is actually just GATK's behavior when told to emit all sites.

            Comment


            • #7
              AFAIK, "./." means the call at that position is missing. It could be missing for a variety of reasons. It might be missing because a call was made, but it didn't meet some threshold and was filtered out. It could also be missing because multiple VCFs are merged together, and samples that don't have a call at a position are listed as "./." Usually by default, when someone runs a variant caller on an individual sample, only the variant calls are emitted so there are no "0/0" reference calls in the VCF file, which is fine if you want to know the variants of a single sample. This becomes a problem when you want to compare SNP calls across samples, because you can't assume that an absent call in the VCF means it was a reference call (because it could have also been a position where the caller couldn't make an accurate call). One option is to force the caller to emit all calls, even reference calls. This will generate very large files. Another option is to call SNPs simultaneously on the samples and output a multi-sample VCF (which can be done with samtools mpileup or GATK). For this solution, if one of the samples has a variant, then the calls for all the other samples will be emitted too, even if it's reference or low quality. I prefer this way, because the VCF file is still small, and it is then an easy task to find positions that have different calls across the samples.

              Hope that is of some help. We've wrestled with how best to handle this quite a bit, so definitely interesting to hear how others manage it.

              Justin

              Comment


              • #8
                Originally posted by BAMseek View Post
                Another option is to call SNPs simultaneously on the samples and output a multi-sample VCF (which can be done with samtools mpileup or GATK). For this solution, if one of the samples has a variant, then the calls for all the other samples will be emitted too, even if it's reference or low quality. I prefer this way, because the VCF file is still small, and it is then an easy task to find positions that have different calls across the samples.

                Justin

                Precisely, that's what I've always seen. The multiple-sample VCF files have variant in one of the samples and the rest of the sample either are "0/0" or "./.". I asked the collaborator and they say that in their files "./." corresponds to homozygous reference.

                But I noticed this weired genotyping in the mouse multi-strain/sample InDels vcf (http://www.sanger.ac.uk/resources/mouse/genomes/) file too. For all the positions that have an InDel called in one of the mouse strains, no strain had 0/0 genotype (everything else was "./."). There were no 0/0 in C57BL/6NJ strain too which is considered to be the reference strain.

                Clearly this is not in the standard vcf4.1 format. I've posted question to [email protected] and am waiting to hear from them.

                Comment


                • #9
                  Originally posted by slp View Post
                  I asked the collaborator and they say that in their files "./." corresponds to homozygous reference.
                  This is a guess, but it may be that SNPs were called on the samples individually, and then those VCFs were merged into one multi-sample VCF. Provided there was sufficient evidence at a position for the caller to make a confident call, then those ./. calls would be homozygous reference calls. However, it could also be the case that the call is absent because the caller couldn't make a call at that position. The way I've dealt with this situation before is that I would go back to the BAM file to figure out the depth, and say if the depth is above a certain number, then it is a ref call, otherwise it is a no-call. That's the best I could think to do, but less than optimal.

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Current Approaches to Protein Sequencing
                    by seqadmin


                    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                    04-04-2024, 04:25 PM
                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 04-11-2024, 12:08 PM
                  0 responses
                  18 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 10:19 PM
                  0 responses
                  22 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 09:21 AM
                  0 responses
                  17 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-04-2024, 09:00 AM
                  0 responses
                  49 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X