Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • mpileup for SNP (PV4)

    I got INFO in the raw.bcf as:
    DP=118;AF1=1;CI95=1,1;DP4=0,0,1,42;MQ=20;FQ=-156
    DP=154;AF1=1;CI95=1,1;DP4=0,1,1,42;MQ=20;FQ=-139;PV4=1,1,1,1

    It is confused why the first one have no PV4. I could understand that PV4 are strand bias, baseQ bias, mapQ bias and tail distance bias. These four biases are obtained by exact test of DP4, t test of baseQ, T test of mapQ and T test of tail distance separately. In the first case, we do have DP4, why can we get the Strand bias? If it is because that there is no sample in the reference group, why can we get the other three biases? Are they one sample T test or Two sample T test? These bias is used to determine whether baseQ or mapQ prefer to be a fix number. Then how to determine this fix number? Thanks

    Anyway, when we had the PV4, how can we determine the SNP qualities with these p values? Obviously, we do not want larger biases. I thought the lower the p value is, the significant the bias is. However, I am not sure it is right. Just take tail distance bias as an example, do we have to pooled all the tail distance first and then do the T test? If so, I think lower p value will correspond to widely distribution of the tail distance and thus we will have lower bias. Again, this is just my thought. I am really confused about all these value.

    I know in the vcfutils.pl varFilter all the filter options for PV4 are the minimum values. Why? Does that mean the larger P value is better? Does any one have any optimal values for these filter options?

    Thank you very much,

    fanping

  • #2
    Originally posted by fanping View Post
    I got INFO in the raw.bcf as:
    DP=118;AF1=1;CI95=1,1;DP4=0,0,1,42;MQ=20;FQ=-156
    DP=154;AF1=1;CI95=1,1;DP4=0,1,1,42;MQ=20;FQ=-139;PV4=1,1,1,1

    It is confused why the first one have no PV4. I could understand that PV4 are strand bias, baseQ bias, mapQ bias and tail distance bias. These four biases are obtained by exact test of DP4, t test of baseQ, T test of mapQ and T test of tail distance separately. In the first case, we do have DP4, why can we get the Strand bias? If it is because that there is no sample in the reference group, why can we get the other three biases? Are they one sample T test or Two sample T test? These bias is used to determine whether baseQ or mapQ prefer to be a fix number. Then how to determine this fix number? Thanks

    Anyway, when we had the PV4, how can we determine the SNP qualities with these p values? Obviously, we do not want larger biases. I thought the lower the p value is, the significant the bias is. However, I am not sure it is right. Just take tail distance bias as an example, do we have to pooled all the tail distance first and then do the T test? If so, I think lower p value will correspond to widely distribution of the tail distance and thus we will have lower bias. Again, this is just my thought. I am really confused about all these value.

    I know in the vcfutils.pl varFilter all the filter options for PV4 are the minimum values. Why? Does that mean the larger P value is better? Does any one have any optimal values for these filter options?

    Thank you very much,

    fanping
    These are good questions that I've been wondering as well, hopefully somebody (prehaps Heng himself) can give a good response.

    Comment


    • #3
      Thanks for your reply. Hope some one could answer our question

      Comment


      • #4
        Well, empirically, the PV4 scores only show up on mixed calls, not homozygous. So I think those values are about assessing if there is a significant quality difference beween the reads saying the alternate letter, and reads saying the reference letter. If all the reads saying reference letter are great quality, come from both directions, and the reference letter falls in the start of some reads, and the middle of others, while all of the reads saying alternate letter have crap mapQ, and mostly come from one direction, and all the alternate letters are in the last 4 bases of their reads, you probably don't have a real mixed letter at all, because the reads saying you have an alternate letter are messed up.

        So you can't have those stats for homoxygous calls. There's nothing to compare to.

        Comment


        • #5
          So does that mean the PV4 is a prosperity of genotype 0/1? I know lots of genotype 1/1 also have PV4.

          If PV4 is used to describe the homs and my data has a haploid genome, which means I have to filter all the homs info, so the PV4 will give me no information on the quality of the SNP or INDELs. Is my understanding right? Thank you very much.

          Originally posted by swbarnes2 View Post
          Well, empirically, the PV4 scores only show up on mixed calls, not homozygous. So I think those values are about assessing if there is a significant quality difference beween the reads saying the alternate letter, and reads saying the reference letter. If all the reads saying reference letter are great quality, come from both directions, and the reference letter falls in the start of some reads, and the middle of others, while all of the reads saying alternate letter have crap mapQ, and mostly come from one direction, and all the alternate letters are in the last 4 bases of their reads, you probably don't have a real mixed letter at all, because the reads saying you have an alternate letter are messed up.

          So you can't have those stats for homoxygous calls. There's nothing to compare to.

          Comment


          • #6
            Originally posted by fanping View Post
            So does that mean the PV4 is a prosperity of genotype 0/1? I know lots of genotype 1/1 also have PV4.

            If PV4 is used to describe the homs and my data has a haploid genome, which means I have to filter all the homs info, so the PV4 will give me no information on the quality of the SNP or INDELs. Is my understanding right? Thank you very much.
            Just because you are doing a haploid genome doesn't mean that you can't have genuine mixed calls. Submitters don't always give clonal samples.

            For the 1/1 calls with PV4 values, do the DP4's show at least one read for reference allele? I bet they do.

            So yes, it looks like the PV4 isn't going to help you evaluate homozygous calls. It doesn't look like it's supposed to. It looks like it's supposed to tell you on mixed calls whether the evidence supporting one of the alleles is suspect.

            Use the DP4, and the GQ, and the PL to evaluate homozygous calls.

            Comment


            • #7
              Thanks for your concise and useful reply. You are right that 1/1 calls with PV4 do have some reference allele.

              I just remember that the strand bias in PV4, (i.e. 1st value in PV4) can be calculated with DP4 using exact test. (p.s. I calculated and it works.) If DP4 is not just information of 0/1 calls, why PV4 is only for 0/1?

              I appreciate if you can also help me to understand this. Thanks.

              Originally posted by swbarnes2 View Post
              Just because you are doing a haploid genome doesn't mean that you can't have genuine mixed calls. Submitters don't always give clonal samples.

              For the 1/1 calls with PV4 values, do the DP4's show at least one read for reference allele? I bet they do.

              So yes, it looks like the PV4 isn't going to help you evaluate homozygous calls. It doesn't look like it's supposed to. It looks like it's supposed to tell you on mixed calls whether the evidence supporting one of the alleles is suspect.

              Use the DP4, and the GQ, and the PL to evaluate homozygous calls.

              Comment


              • #8
                I would like to know if anyone has any optimal values for PV4. Specially to filter Indels.

                Thanks

                Comment


                • #9
                  Computing PV4 for multi-base variants

                  Hi!

                  Does anyone know how PV4 are computed for multi-base variants. More specifically,
                  1) How is tail-distance computed for MBV? Is it the shortest distance from the start of the MBV to either end?
                  2) How is base quality bias computed (w.r.t. MBV)?

                  Thanks

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Current Approaches to Protein Sequencing
                    by seqadmin


                    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                    04-04-2024, 04:25 PM
                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 04-11-2024, 12:08 PM
                  0 responses
                  22 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 10:19 PM
                  0 responses
                  24 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 09:21 AM
                  0 responses
                  20 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-04-2024, 09:00 AM
                  0 responses
                  52 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X