SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   General (http://seqanswers.com/forums/forumdisplay.php?f=16)
-   -   R, vcfr, extrcat.gt, bcftools filter (http://seqanswers.com/forums/showthread.php?t=101957)

wbsimey 08-19-2021 10:06 AM

R, vcfr, extrcat.gt, bcftools filter
 
Hello,
I have been unable to extract depth 'DP' info from my vcf files (generated with latest bcftools variant calling (SNPs)) after being converted to vcfr for R analyses.

bcftools may break the vcf formatting. I cannot extract depth data in R packages, for e.g., all I get is 'NA' for each depth (DP). I can see the DP value in the vcf, but R packages cannot see it.

VCF format after bcftools (DP values are clearly present):
Chr10_nm_RagTag 187 . A T 4.82181 LowQual DP=38;VDB=0.407747;SGB=-0.0966945;RPB=0.0621765;MQB=0.984496;MQSB=0.618783;BQB=0.810528;MQ0F=0.157895;AC=2;AN=26;DP4=7,2,3,3;MQ=9 GT:PL:AD:QS ./.:0,0,0:0,0:0,0 0/0:0,3,4:1,0:4,0 ./.:0,0,0:0,0:0,0 ./.:0,0,0:0,0:0,0 ./.:0,0,0:0,0:0,0 ./.:0,0,0:0,0:0,0 ./.:0,

but the often used function "extract.gt" misses the DP values and only returns items highlighted in green.

> dp <- extract.gt(vcf, element = "DP", as.numeric=TRUE)
> dp[1:4,1:6]
MZ195912 MZ195913 MZ195914 MZ195915 MZ195916 MZ195930
Chr10_nm_RagTag_187 NA NA NA NA NA NA
Chr10_nm_RagTag_188 NA NA NA NA NA NA
Chr10_nm_RagTag_194 NA NA NA NA NA NA
Chr10_nm_RagTag_195 NA NA NA NA NA NA

I think this is because bcftools filtering placed DP info in the wrong field and when data are converted to the required R format, the info is lost. If I do a head on the converted data I see no DP info:

head(vcf)
[1] "***** Object of class 'vcfR' *****"
[1] "***** Meta section *****"
[1] "##fileformat=VCFv4.2"
[1] "##FILTER=<ID=PASS,Description="All filters passed">"
[1] "##bcftoolsVersion=1.12+htslib-1.12"
[1] "##bcftoolsCommand=mpileup -Ou -I -B -C 50 -a QS -a AD --threads 32 - [Truncated]"
[1] "##reference=file:///home/bsimison/Projects/Neotoma/96_LC-genomes/Nbr [Truncated]"
[1] "##contig=<ID=Chr10_nm_RagTag,length=45123732>"
[1] "First 6 rows."
[1]
[1] "***** Fixed section *****"
CHROM POS ID REF ALT QUAL FILTER
[1,] "Chr10_nm_RagTag" "187" NA "A" "T" "4.82181" "LowQual"
[2,] "Chr10_nm_RagTag" "188" NA "G" "A" "42.0559" "LowQual"
[3,] "Chr10_nm_RagTag" "194" NA "G" "C" "3.74867" "LowQual"
[4,] "Chr10_nm_RagTag" "195" NA "A" "C" "5.61943" "LowQual"
[5,] "Chr10_nm_RagTag" "488" NA "G" "C" "32.0871" "PASS"
[6,] "Chr10_nm_RagTag" "518" NA "T" "C" "157.534" "PASS"
[1]
[1] "***** Genotype section *****"
FORMAT MZ195912 MZ195913
[1,] "GT:PL:AD:QS" "./.:0,0,0:0,0:0,0" "0/0:0,3,4:1,0:4,0"
[2,] "GT:PL:AD:QS" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0"
[3,] "GT:PL:AD:QS" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0"
[4,] "GT:PL:AD:QS" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0"
[5,] "GT:PL:AD:QS" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0"
[6,] "GT:PL:AD:QS" "1/1:29,3,0:0,1:0,29" "./.:0,0,0:0,0:0,0"
MZ195914 MZ195915 MZ195916
[1,] "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0"
[2,] "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0"
[3,] "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0"
[4,] "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0"
[5,] "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0"
[6,] "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0"
[1] "First 6 columns only."
[1]
[1] "Unique GT formats:"
[1] "GT:PL:AD:QS"
[1]

you can see that only "GT:PL:AD:QS" are captured, but not DP.
Anybody know what's going on here?

wbsimey 08-23-2021 01:24 PM

Brian Knaus, on his vcfR Git site, helped me figure this out. Turns out if you want to include Hardy Weinberg in your population variant calling with bcftools, you must include some annotations with the -a option for bcftools mpileup. Any annotations not listed with -a will not be included in the 'gt' slot; they will be in the 'info' slot so "extract.gt" will ignore any annotations not in the 'gt' slot. Alternatively, you can try 'extract.info', but that did not pull DP stats for each individual for me.

The bcftools commands that solved the above issue was:

"bcftools mpileup -Ou -I -B -C 50 -a QS -a AD -a DP --threads $THREADS -f $REF $bam_files"
call_cmd="bcftools call -mv -Oz --threads $THREADS -o $OUTFILE"


All times are GMT -8. The time now is 02:18 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.