Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • VCF Indel encoding question

    Hi,

    I'm having problems understanding a GATK output VCF. I have read the VCF standard, but I'm obviously missing something.

    I /think/ I understand how SNPs and short indels are represented, but clearly I do not. Below is an excerpt that illustrates sites which I do not understand. I suspect it may be something to do with GATK quality filters that I'm not understanding...

    The excerpt below was generated using

    GATK -l INFO -I my.bam -R my.fa -T UnifiedGenotyper -S LENIENT -nt 8 --heterozygosity 0.1 -o test.vcf --genotype_likelihoods_model BOTH --min_base_quality_score 10 --output_mode EMIT_ALL_SITES -ploidy 2

    Thanks!

    Darren

    -------------------------------------------------------
    Code:
    CH1	225	.	T	G	12.71	LowQual	AC=1;AF=0.500;AN=2;BaseQRankSum=1.978;DP=59;Dels=0.03;FS=0.000;HaplotypeScore=10.2840;MLEAC=1;MLEAF=0.500;MQ=70.25;MQ0=8;MQRankSum=-5.349;QD=0.22;ReadPosRankSum=-3.188	GT:AD:DP:GQ:PL	0/1:41,16:55:20:20,0,1435
    CH1	226	.	T	.	121.53	.	AN=2;DP=59;MQ=70.25;MQ0=8	GT:DP	0/0:43
    CH1	227	.	A	.	121.53	.	AN=2;DP=59;MQ=70.25;MQ0=8	GT:DP	0/0:43
    CH1	228	.	T	.	121.53	.	AN=2;DP=59;MQ=70.25;MQ0=8	GT:DP	0/0:43
    CH1	229	.	A	.	115.53	.	AN=2;DP=57;MQ=69.66;MQ0=8	GT:DP	0/0:38
    CH1	230	.	C	.	115.53	.	AN=2;DP=57;MQ=69.66;MQ0=8	GT:DP	0/0:38
    CH1	231	.	T	.	115.53	.	AN=2;DP=57;MQ=69.66;MQ0=8	GT:DP	0/0:36
    CH1	232	.	G	.	115.53	.	AN=2;DP=57;MQ=69.66;MQ0=8	GT:DP	0/0:36
    CH1	233	.	C	.	115.53	.	AN=2;DP=57;MQ=69.66;MQ0=8	GT:DP	0/0:37
    CH1	234	.	A	.	139.53	.	AN=2;DP=70;MQ=59.20;MQ0=14	GT:DP	0/0:63
    CH1	235	.	A	.	175.53	.	AN=2;DP=84;MQ=51.67;MQ0=15	GT:DP	0/0:79
    CH1	236	.	A	.	175.53	.	AN=2;DP=84;MQ=51.67;MQ0=15	GT:DP	0/0:79
    CH1	237	.	T	.	175.53	.	AN=2;DP=85;MQ=51.37;MQ0=16	GT:DP	0/0:80
    CH1	238	.	A	.	175.53	.	AN=2;DP=102;MQ=46.90;MQ0=28	GT:DP	0/0:97
    CH1	238	.	A	AGAAAGAAAGCTTGTA	83.73	.	AC=1;AF=0.500;AN=2;BaseQRankSum=6.172;DP=102;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=46.90;MQ0=0;MQRankSum=-6.190;QD=0.05;ReadPosRankSum=-5.733	GT:AD:DP:GQ:PL	0/1:27,25:57:99:121,0,4853
    CH1	239	.	A	.	175.53	.	AN=2;DP=102;MQ=46.90;MQ0=28	GT:DP	0/0:101
    CH1	240	.	T	.	175.53	.	AN=2;DP=102;MQ=46.90;MQ0=28	GT:DP	0/0:98
    CH1	241	.	A	.	169.53	.	AN=2;DP=108;MQ=44.14;MQ0=29	GT:DP	0/0:107
    CH1	242	.	T	.	169.53	.	AN=2;DP=109;MQ=43.94;MQ0=29	GT:DP	0/0:103
    CH1	242	.	T	.	118.27	.	AN=2;DP=109;MQ=43.94;MQ0=29	GT:AD:DP	0/0:27:55
    CH1	243	.	C	.	172.53	.	AN=2;DP=110;MQ=43.76;MQ0=29	GT:DP	0/0:108
    CH1	243	.	CTTTT	.	118.27	.	AN=2;DP=110;MQ=43.76;MQ0=29	GT:AD:DP	0/0:27:56
    CH1	244	.	T	.	91.53	.	AN=2;DP=110;MQ=43.76;MQ0=29	GT:DP	0/0:61
    CH1	245	.	T	.	91.53	.	AN=2;DP=110;MQ=43.76;MQ0=29	GT:DP	0/0:53
    CH1	246	.	T	.	73.53	.	AN=2;DP=110;MQ=43.76;MQ0=29	GT:DP	0/0:41
    CH1	247	.	T	.	91.53	.	AN=2;DP=110;MQ=43.76;MQ0=29	GT:DP	0/0:46
    CH1	248	.	A	.	172.53	.	AN=2;DP=116;MQ=42.61;MQ0=31	GT:DP	0/0:100
    CH1	249	.	A	.	172.53	.	AN=2;DP=116;MQ=42.61;MQ0=31	GT:DP	0/0:100
    CH1	250	.	T	.	172.53	.	AN=2;DP=117;MQ=42.43;MQ0=32	GT:DP	0/0:101
    CH1	251	.	T	.	169.53	.	AN=2;DP=117;MQ=42.43;MQ0=32	GT:DP	0/0:96
    CH1	251	.	T	.	118.27	.	AN=2;DP=117;MQ=42.43;MQ0=32	GT:AD:DP	0/0:27:56
    CH1	252	.	C	.	172.53	.	AN=2;DP=117;MQ=42.43;MQ0=32	GT:DP	0/0:113
    CH1	253	.	C	.	172.53	.	AN=2;DP=117;MQ=42.43;MQ0=32	GT:DP	0/0:110
    CH1	254	.	T	.	172.53	.	AN=2;DP=117;MQ=42.43;MQ0=32	GT:DP	0/0:111
    CH1	255	.	T	.	172.53	.	AN=2;DP=117;MQ=42.43;MQ0=32	GT:DP	0/0:111
    CH1	256	.	T	.	172.53	.	AN=2;DP=117;MQ=42.43;MQ0=32	GT:DP	0/0:111
    Line 1 is a SNP
    Lines 14 and 15 are an indel that I do understand
    Lines 19 and 20 I do /not/ understand
    Lines 21 and 22 I do /not/ understand
    ---------------------------------------------------

  • #2
    Darren, maybe you should re-post with the lines numbered. You can use the Unix "nl" command to do this.

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Essential Discoveries and Tools in Epitranscriptomics
      by seqadmin


      The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
      Yesterday, 07:01 AM
    • seqadmin
      Current Approaches to Protein Sequencing
      by seqadmin


      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
      04-04-2024, 04:25 PM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, 04-11-2024, 12:08 PM
    0 responses
    51 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-10-2024, 10:19 PM
    0 responses
    50 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-10-2024, 09:21 AM
    0 responses
    44 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-04-2024, 09:00 AM
    0 responses
    55 views
    0 likes
    Last Post seqadmin  
    Working...
    X