Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Missing information in VCFs

    Hi everybody!
    I've just started working with all these bioinformatic analysis tools, so please forgive me if my question is stupid.

    I am comparing the VCF files I get from SAMTools with the old Pileup files, and there is some information that is present in the Pileup but absent from the VCFs.
    Highlighted below:

    Pileup:
    chr10 93489 G S 26 26 33 8 c,,,cc,, W]Z]QMZJ

    VCF:
    chr10 94083 . C T 71 . DP=43;AF1=0.5;CI95=0.5,0.5;DP4=5,5,2,25;MQ=23;FQ=47;PV4=0.0092,4.4e-22,1,1 GT:PL:GQ 0/1:101,0,74:77

    Does anyone know how to extract this info using the VCF format?

    Thanks a lot
    Amit

  • #2
    Hi Amit,

    the part from the pileup that you underlined corresponds to two columns:

    c,,,cc,, the bases in all reads (in this case 8) covering the given position and
    W]Z]QMZJ the base qualities (corresponding to the 8 bases) for these bases

    Check here for the details of the pileup format and on how bases and qualities are annotated.

    This information does appear somewhat summarized in the vcf format:

    The 4th and 5th column show the reference and alternative base, as summarized from the frequency of each base at this position in the pileup (in your example, the position of the pileup and the vcf do not coincide).

    As for the base qualities, there is a field inside INFO defined in the VCF format that is named BQ and gives you the root mean square of the base qualities at this position as a summary of the individual base qualities in the pileup. (Your example does not have this field though, you might try another variant caller.)

    Check here for the details of the vcf format.

    Obviously, the pileup format gives you more details in some respects, but then the focus of the vcf is a different one. However, basically, the information is kept.

    Cheers!

    Comment


    • #3
      Hi sdvie

      Thanks for the quick reply!

      I am currently using SAMTools 0.1.16 for this process. What would you suggest to use in order to get this information?

      cheers

      Comment


      • #4
        Hi Amit,

        unfortunately, I could not find any tool that outputs a vcf file containing the BQ field.
        (I am mostly using the GATK pipeline and the GATK Unified Genotyper on my bam files.)

        Maybe someone else knows more...

        cheers,
        Sophia

        Comment


        • #5
          sorry, have to correct myself:

          there is an option in samtools:

          calmd -r

          Looks like you have to generate an extended sam file with this command first and then generate the pileup from this one to have the BQ tag included.

          Never used this one before... live and learn.

          cheers!
          Last edited by sdvie; 08-30-2011, 02:21 AM.

          Comment


          • #6
            Thank you Sophie!
            Another thing, I need to know how many read are reference and how many are variant.
            Is this information present in the VCF?

            Good day
            Amit

            Comment


            • #7
              Yes, see the DP and DP4 fields.

              Comment


              • #8
                YES!! Just what I needed!

                Thanks a lot you guys, you saved me life!

                Have a cheerful day =D

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM
                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                25 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                29 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                25 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-04-2024, 09:00 AM
                0 responses
                52 views
                0 likes
                Last Post seqadmin  
                Working...
                X