Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • What does DV format mean in a VCF file ?

    Hello everyone,

    I got a VCF file with the following line :

    ##FORMAT=<ID=DV,Number=1,Type=Integer,Description="Number of high-quality non-reference bases">
    My question is pretty simple but I didn't find any answer : what is a "high quality" ? How can I know which threshold was used ?

    I've used samtools mpileup and then bcftools call with -m option.

  • #2
    The '-Q' option of samtools mpileup sets the quality threshold for bases; by default any base with a quality of less than 13 is ignored.

    Comment


    • #3
      The SNP is not filtered out because I obtain it in my VCF file.
      For example I can have a SNP with a DP equal to 40 but the DV will be 4,3 and 5 for my 3 samples.

      So I dunno what does that mean...

      Comment


      • #4
        The '-Q' option of mpileup refers to individual bases, not entire reads, or SNPs inferred from reads. Quality can vary over the length of a read, so this statistic only counts portions of reads that are high quality.

        Comment


        • #5
          Originally posted by ClemBuntu View Post
          The SNP is not filtered out because I obtain it in my VCF file.
          For example I can have a SNP with a DP equal to 40 but the DV will be 4,3 and 5 for my 3 samples.

          So I dunno what does that mean...
          DP is the total number of reads that cover the SNP position; of those, four contained the SNP (e.g., G when the reference is T) for sample A, three for sample B, and five for sample C. The remaining reads typically match the reference base (T), although it's possible that they contain non-SNP/non-reference calls (A or C).

          Comment


          • #6
            Originally posted by HESmith View Post
            DP is the total number of reads that cover the SNP position; of those, four contained the SNP (e.g., G when the reference is T) for sample A, three for sample B, and five for sample C. The remaining reads typically match the reference base (T), although it's possible that they contain non-SNP/non-reference calls (A or C).
            Then why the DV description is "Number of high-quality non-reference bases" and not "Number of non-reference bases" ?

            Plus you're talking about the DP which is in INFO :
            ##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">

            But if you look at the DP which is in format :
            ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Number of high-quality bases">

            I could also ask what's the difference between DP and DP4 :

            ##INFO=<ID=DP4,Number=4,Type=Integer,Description="Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases">

            Comment


            • #7
              Then why the DV description is "Number of high-quality non-reference bases" and not "Number of non-reference bases" ?
              Because it's usually only the high-quality bases that should be looked at when identifying variants.

              I could also ask what's the difference between DP and DP4
              DP4 gives another way of identifying poorly-covered regions. Variants should have roughly equal numbers of forward and reverse reads (assuming a sample prep that is not strand-specific). An imbalance in this may indicate that something funny is going on with sequences that span the region.

              Comment


              • #8
                Originally posted by gringer View Post
                Because it's usually only the high-quality bases that should be looked at when identifying variants.
                I agree, and according to your previous post "high quality" means above "-Q" option used with mpileup right (ie 13 by default) ?

                Comment


                • #9
                  I expect so. Modifying the '-Q' option changes how many bases are shown in the mpileup output, and I would expect that those are the only bases that make it through for the variant calculations.

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Essential Discoveries and Tools in Epitranscriptomics
                    by seqadmin




                    The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                    04-22-2024, 07:01 AM
                  • seqadmin
                    Current Approaches to Protein Sequencing
                    by seqadmin


                    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                    04-04-2024, 04:25 PM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, Yesterday, 11:49 AM
                  0 responses
                  15 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-24-2024, 08:47 AM
                  0 responses
                  16 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-11-2024, 12:08 PM
                  0 responses
                  61 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 10:19 PM
                  0 responses
                  60 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X