Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • The VCF file interpretation

    Hi all, i am working with VCF files from NGS data analysis generated using open source tools.

    In the VCF files we have genotypes listed as 0/1,1/1 etc, in my case i almost never see a row that is 0/0.

    Can someone please help with this, is there something obvious that might be missing which generates the 0/0 genotypes calls?

    Using the 1000 Genomes data.

    Thanks,
    Ashwin

  • #2
    What software/command line did you use to generate your VCF?

    Comment


    • #3
      Hi there,
      I am using SAMTools and VCFUtils, the following are the chain of commands.

      samtools sort eg2.bam eg2.sorted

      samtools mpileup -uf reference_genome_file eg2.sorted.bam | bcftools view -bvcg - > eg2.raw.bcf

      bcftools view *.bcf | perl /usr/share/samtools/vcfutils.pl varFilter -D100 > *.vcf

      Comment


      • #4
        Your command line is requesting a variants-only VCF. So, sites with 0/0 (identical to the reference) shouldn't appear right?

        What happens if you use bcftools view -bcg? In this case, you are asking for all sites and you should see many 0/0?

        Comment


        • #5
          Hi there,
          I did run the command as you specified, but again i am not seeing a 0/0. I was working with sample HG115 from 1K Genomes and the reference assembly i used (from 1K genomes website) was hs37d5.fa

          Any ideas?

          Thanks.

          Comment


          • #6
            Hi ashkot,

            I think I mentioned this post by lh3 before, but he recommends not looking at the genotype calls in the vcf at all:



            I am running your command line on some of my data to see if I get the same as you

            Comment


            • #7
              ashkot,

              Out of curisiosity, can I ask what you are hoping to extract from the genotype calls? Are you trying to generate a consensus?

              Comment


              • #8
                jflowers,
                i am trying to determine the zygosity for variants. i have a database of variants and i want to find out if a person is homozygous wildtype, hetrozygous and homozygous risk allele for those variants.

                since in the vcf file i generated the calls are 0/1 or 1/1 i dont know if the absense of a variant means homozygous wildtype.

                i would prefer if such variants were also in the VCF.

                i am waiting to hear from you about the results that you get when you run the command-line.

                thanks.

                ashwin

                Comment


                • #9
                  Originally posted by ashkot View Post
                  Hi there,
                  I am using SAMTools and VCFUtils, the following are the chain of commands.

                  samtools sort eg2.bam eg2.sorted

                  samtools mpileup -uf reference_genome_file eg2.sorted.bam | bcftools view -bvcg - > eg2.raw.bcf

                  bcftools view *.bcf | perl /usr/share/samtools/vcfutils.pl varFilter -D100 > *.vcf
                  I see the problem. Looking at the source code for vcfutils.pl varFilter there is the following line of code:

                  next if ($t[4] eq '.'); # skip non-var sites

                  $t[4] contains data for the fifth column in the vcf, so even if you request all sites to be output using bcftools view -cg calling vcfutils.pl varFilter is filtering invariant sites (ie site with the same basecall as the reference).

                  One workaround woud be to avoid using vcfutils.pl varFilter, so why not try:

                  samtools mpileup -uf reference_genome_file eg2.sorted.bam | bcftools view -cg - > *.vcf

                  If all you need to do is filter out read depth > 100 (-D 100) then this could easily be done without vcfutils. There are a number of vcf filtering tools available. Another solution would be to modify the vcfutils.pl varFilter to suit your needs, but I wouldn't do that unless you fully understand the implications of your modifications.

                  Comment


                  • #10
                    One final parting shot, given that Heng Li has indicated that there is a problem with the genotypes in the vcf (see example at http://biostar.stackexchange.com/que...quality-scores) I would hesitate before using the genotypes 0/0, 0/1, 1/1 by itself without considering the PL scores.

                    Comment


                    • #11
                      jflowers, i did exactly as you suggested and the program output a huge VCF file although there is not a single 0/0 row. I am not sure what could be going on here?

                      Any further ideas?

                      Thanks.

                      Comment


                      • #12
                        Good, at least now youre not filtering out the homozygous sites.

                        If you follow the example carefully from the biostar-exchange link I sent you, you will see that the genotypes (e.g., 0/0,0/1,1/1) are in fact not called correctly (I'm not sure why). Heng Li (samtools lead developer) to this post responded acknowledging this fact, so don't look at these genotypes.

                        Using the command line I suggested (without using vcfutils.pl varFilter) should have produced a vcf with many rows with reference allele column being A,T,G,or C and the alternate allele column is ".". Right? These sites are the homozygotes you are looking for.

                        Can you post say a few examples like this, then we can see what is going on.

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Essential Discoveries and Tools in Epitranscriptomics
                          by seqadmin


                          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                          Yesterday, 07:01 AM
                        • seqadmin
                          Current Approaches to Protein Sequencing
                          by seqadmin


                          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                          04-04-2024, 04:25 PM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 04-11-2024, 12:08 PM
                        0 responses
                        55 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 10:19 PM
                        0 responses
                        51 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 09:21 AM
                        0 responses
                        45 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-04-2024, 09:00 AM
                        0 responses
                        55 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X