Hi all, I hope someone can help, have you come across anything like this yourself? I have several bacterial strain BWA genome assemblies. I am using Samtools to call the consensus and have done it twice using 2 different sets of options and come up against 2 problems:
Program: samtools-0.1.18
Command used to call FIRST consensus: samtools mpileup -q 30 -uf ref.fa aln.bam | bcftools view -cg - | vcfutils.pl vcf2fq > cns.fq
(All FASTQ files converted to FASTA for phylogenetic analysis)
PROBLEM 1: Core genome variable sites were found to have significant homoplasies (significant P values in Splitstree) which was not expected. On closer investigation of the sequence, it was found that in regions with 2 or more SNPs that were closely situated together in relation to the reference, the program had called the reference sequence instead of the SNPs, leading to multiple strains with the same sequence at these sites when in fact they were variants! The SNPs looked to be genuine (In IGV and the VCF file we produced for all sites, there is adequate read depth, mapping quality etc)
To try to rectify this, we disabled BAQ in the next set of consensus calling:
Program: samtools-0.1.18
Command used to call SECOND consensus: samtools mpileup -q 30 -B -uf ref.fa aln.bam | bcftools view -cg - | vcfutils.pl vcf2fq > cns.fq
(All FASTQ files converted to FASTA for phylogenetic analysis)
PROBLEM2: Now we find that this seems to have rectified the problem where the reference sequence has been called instead of the variants (2 SNPS close together) , but instead, there are an increased number of uncalled bases. On investigation, these seem to be at sites where there may be some reads with a T, for example, and a few reads with an A (eg 2/8 reads are A, while the remaining reads are T, and the reference is T!). In some strains, it's calling 'N', in some, it's calling 'W' and in others it calls the 'T'! (But they all have varying levels of read depth and mapping quality)
See results from the VCF file at one example site (618760) where the ref base is 'T':
Strain 1 uncalled base 'N':
REF 618760 . T . 54 . DP=10;VDB=0.0571;AF1=0;AC1=0;DP4=0,8,0,0;MQ=58;FQ=-51 PL 0[/FONT]
Strain 2 ambiguous base 'W':
REF 618760 . T . 72 . DP=19;VDB=0.0027;AF1=0;AC1=0;DP4=1,13,0,0;MQ=60;FQ=-69 PL 0
Strain 3 called base 'T':
REF 618760 . T . 96 . DP=25;VDB=0.0672;AF1=0;AC1=0;DP4=1,21,0,0;MQ=60;FQ=-93 PL 0
Questions:
a) What exactly is causing this variation in base calling among the strains in the second set of consensus sequences? (disabled BAQ or is it something I can tell from the VCF file?)
b) What does VDB flag stand for? Can't find any information on it? (IF we remove it from the VCF file and recall the consensus, it will call the base, but we don't want to do this at all~25000 sites with a VDB flag in case it introduces error!)
Any help. is very much appreciated, I just can't understand it, especially why it's having trouble calling the sites with mixed bases, when the read depth and mapping quality are good, and by looking at the assembly you can clearly see which base it should call!
Best Wishes,
Laura Spoor
Comment