Hello,
I have been unable to extract depth 'DP' info from my vcf files (generated with latest bcftools variant calling (SNPs)) after being converted to vcfr for R analyses.
bcftools may break the vcf formatting. I cannot extract depth data in R packages, for e.g., all I get is 'NA' for each depth (DP). I can see the DP value in the vcf, but R packages cannot see it.
VCF format after bcftools (DP values are clearly present):
Chr10_nm_RagTag 187 . A T 4.82181 LowQual DP=38;VDB=0.407747;SGB=-0.0966945;RPB=0.0621765;MQB=0.984496;MQSB=0.618783;BQB=0.810528;MQ0F=0.157895;AC=2;AN=26;DP4=7,2,3,3;MQ=9 GT:PL:AD:QS ./.:0,0,0:0,0:0,0 0/0:0,3,4:1,0:4,0 ./.:0,0,0:0,0:0,0 ./.:0,0,0:0,0:0,0 ./.:0,0,0:0,0:0,0 ./.:0,0,0:0,0:0,0 ./.:0,
but the often used function "extract.gt" misses the DP values and only returns items highlighted in green.
> dp <- extract.gt(vcf, element = "DP", as.numeric=TRUE)
> dp[1:4,1:6]
MZ195912 MZ195913 MZ195914 MZ195915 MZ195916 MZ195930
Chr10_nm_RagTag_187 NA NA NA NA NA NA
Chr10_nm_RagTag_188 NA NA NA NA NA NA
Chr10_nm_RagTag_194 NA NA NA NA NA NA
Chr10_nm_RagTag_195 NA NA NA NA NA NA
I think this is because bcftools filtering placed DP info in the wrong field and when data are converted to the required R format, the info is lost. If I do a head on the converted data I see no DP info:
head(vcf)
[1] "***** Object of class 'vcfR' *****"
[1] "***** Meta section *****"
[1] "##fileformat=VCFv4.2"
[1] "##FILTER=<ID=PASS,Description="All filters passed">"
[1] "##bcftoolsVersion=1.12+htslib-1.12"
[1] "##bcftoolsCommand=mpileup -Ou -I -B -C 50 -a QS -a AD --threads 32 - [Truncated]"
[1] "##reference=file:///home/bsimison/Projects/Neotoma/96_LC-genomes/Nbr [Truncated]"
[1] "##contig=<ID=Chr10_nm_RagTag,length=45123732>"
[1] "First 6 rows."
[1]
[1] "***** Fixed section *****"
CHROM POS ID REF ALT QUAL FILTER
[1,] "Chr10_nm_RagTag" "187" NA "A" "T" "4.82181" "LowQual"
[2,] "Chr10_nm_RagTag" "188" NA "G" "A" "42.0559" "LowQual"
[3,] "Chr10_nm_RagTag" "194" NA "G" "C" "3.74867" "LowQual"
[4,] "Chr10_nm_RagTag" "195" NA "A" "C" "5.61943" "LowQual"
[5,] "Chr10_nm_RagTag" "488" NA "G" "C" "32.0871" "PASS"
[6,] "Chr10_nm_RagTag" "518" NA "T" "C" "157.534" "PASS"
[1]
[1] "***** Genotype section *****"
FORMAT MZ195912 MZ195913
[1,] "GT:PL:AD:QS" "./.:0,0,0:0,0:0,0" "0/0:0,3,4:1,0:4,0"
[2,] "GT:PL:AD:QS" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0"
[3,] "GT:PL:AD:QS" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0"
[4,] "GT:PL:AD:QS" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0"
[5,] "GT:PL:AD:QS" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0"
[6,] "GT:PL:AD:QS" "1/1:29,3,0:0,1:0,29" "./.:0,0,0:0,0:0,0"
MZ195914 MZ195915 MZ195916
[1,] "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0"
[2,] "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0"
[3,] "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0"
[4,] "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0"
[5,] "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0"
[6,] "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0"
[1] "First 6 columns only."
[1]
[1] "Unique GT formats:"
[1] "GT:PL:AD:QS"
[1]
you can see that only "GT:PL:AD:QS" are captured, but not DP.
Anybody know what's going on here?
I have been unable to extract depth 'DP' info from my vcf files (generated with latest bcftools variant calling (SNPs)) after being converted to vcfr for R analyses.
bcftools may break the vcf formatting. I cannot extract depth data in R packages, for e.g., all I get is 'NA' for each depth (DP). I can see the DP value in the vcf, but R packages cannot see it.
VCF format after bcftools (DP values are clearly present):
Chr10_nm_RagTag 187 . A T 4.82181 LowQual DP=38;VDB=0.407747;SGB=-0.0966945;RPB=0.0621765;MQB=0.984496;MQSB=0.618783;BQB=0.810528;MQ0F=0.157895;AC=2;AN=26;DP4=7,2,3,3;MQ=9 GT:PL:AD:QS ./.:0,0,0:0,0:0,0 0/0:0,3,4:1,0:4,0 ./.:0,0,0:0,0:0,0 ./.:0,0,0:0,0:0,0 ./.:0,0,0:0,0:0,0 ./.:0,0,0:0,0:0,0 ./.:0,
but the often used function "extract.gt" misses the DP values and only returns items highlighted in green.
> dp <- extract.gt(vcf, element = "DP", as.numeric=TRUE)
> dp[1:4,1:6]
MZ195912 MZ195913 MZ195914 MZ195915 MZ195916 MZ195930
Chr10_nm_RagTag_187 NA NA NA NA NA NA
Chr10_nm_RagTag_188 NA NA NA NA NA NA
Chr10_nm_RagTag_194 NA NA NA NA NA NA
Chr10_nm_RagTag_195 NA NA NA NA NA NA
I think this is because bcftools filtering placed DP info in the wrong field and when data are converted to the required R format, the info is lost. If I do a head on the converted data I see no DP info:
head(vcf)
[1] "***** Object of class 'vcfR' *****"
[1] "***** Meta section *****"
[1] "##fileformat=VCFv4.2"
[1] "##FILTER=<ID=PASS,Description="All filters passed">"
[1] "##bcftoolsVersion=1.12+htslib-1.12"
[1] "##bcftoolsCommand=mpileup -Ou -I -B -C 50 -a QS -a AD --threads 32 - [Truncated]"
[1] "##reference=file:///home/bsimison/Projects/Neotoma/96_LC-genomes/Nbr [Truncated]"
[1] "##contig=<ID=Chr10_nm_RagTag,length=45123732>"
[1] "First 6 rows."
[1]
[1] "***** Fixed section *****"
CHROM POS ID REF ALT QUAL FILTER
[1,] "Chr10_nm_RagTag" "187" NA "A" "T" "4.82181" "LowQual"
[2,] "Chr10_nm_RagTag" "188" NA "G" "A" "42.0559" "LowQual"
[3,] "Chr10_nm_RagTag" "194" NA "G" "C" "3.74867" "LowQual"
[4,] "Chr10_nm_RagTag" "195" NA "A" "C" "5.61943" "LowQual"
[5,] "Chr10_nm_RagTag" "488" NA "G" "C" "32.0871" "PASS"
[6,] "Chr10_nm_RagTag" "518" NA "T" "C" "157.534" "PASS"
[1]
[1] "***** Genotype section *****"
FORMAT MZ195912 MZ195913
[1,] "GT:PL:AD:QS" "./.:0,0,0:0,0:0,0" "0/0:0,3,4:1,0:4,0"
[2,] "GT:PL:AD:QS" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0"
[3,] "GT:PL:AD:QS" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0"
[4,] "GT:PL:AD:QS" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0"
[5,] "GT:PL:AD:QS" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0"
[6,] "GT:PL:AD:QS" "1/1:29,3,0:0,1:0,29" "./.:0,0,0:0,0:0,0"
MZ195914 MZ195915 MZ195916
[1,] "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0"
[2,] "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0"
[3,] "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0"
[4,] "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0"
[5,] "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0"
[6,] "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0"
[1] "First 6 columns only."
[1]
[1] "Unique GT formats:"
[1] "GT:PL:AD:QS"
[1]
you can see that only "GT:PL:AD:QS" are captured, but not DP.
Anybody know what's going on here?
Comment