Hi -
I am running a comparison between running samtools SNP calling in the multi sample mode, vs using it on each file individually. I seem to be different getting per-sample genotypes when using one method compared to the other, I wanted to know if anyone had experienced similar difficulties before.
Both methods of calling SNPs use same command, except in one case I use the -b command and provide paths to the bam files I want, and in the other case I pass each file individually and aggregate them later. Just to be on the same page this is the command:
samtools mpileup -uf ref.fa aln1.bam | bcftools view -bvcg - > var.raw.bcf
bcftools view var.raw.bcf | vcfutils.pl varFilter -D100 > var.flt.vcf
where in some cases, instead of aln1.bam I provide -b list, which contains a list of bam files.
When using the multi-sample way (-b command), the vcf file also contains the genotype information for each individual sample. The output format it comes in is GT:PLP:GQ, and the issue that I have is the GT seems to be wrong when I use the multi-sample call. For example, one of my samples is said to have a 1/1 (ALT,ALT) genotype at a specific position, BUT when I review in IGV this is clearly incorrect, furthermore when I run the sample on it's own (not multi-sample mode) this positions DOES NOT show up as containing a SNP. So there is something going on when I pass a list, that seems to mess up the GT (that's what I've concluded).
I think that it's a really weird error. I am trying ti diagnose what the problem could be and am wondering if anyone else has noticed it before.
I am running a comparison between running samtools SNP calling in the multi sample mode, vs using it on each file individually. I seem to be different getting per-sample genotypes when using one method compared to the other, I wanted to know if anyone had experienced similar difficulties before.
Both methods of calling SNPs use same command, except in one case I use the -b command and provide paths to the bam files I want, and in the other case I pass each file individually and aggregate them later. Just to be on the same page this is the command:
samtools mpileup -uf ref.fa aln1.bam | bcftools view -bvcg - > var.raw.bcf
bcftools view var.raw.bcf | vcfutils.pl varFilter -D100 > var.flt.vcf
where in some cases, instead of aln1.bam I provide -b list, which contains a list of bam files.
When using the multi-sample way (-b command), the vcf file also contains the genotype information for each individual sample. The output format it comes in is GT:PLP:GQ, and the issue that I have is the GT seems to be wrong when I use the multi-sample call. For example, one of my samples is said to have a 1/1 (ALT,ALT) genotype at a specific position, BUT when I review in IGV this is clearly incorrect, furthermore when I run the sample on it's own (not multi-sample mode) this positions DOES NOT show up as containing a SNP. So there is something going on when I pass a list, that seems to mess up the GT (that's what I've concluded).
I think that it's a really weird error. I am trying ti diagnose what the problem could be and am wondering if anyone else has noticed it before.
Comment