Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
Generate consensus sequence with Samtools mpileup? Heisman Bioinformatics 4 10-28-2013 05:57 PM
how to use samtools to get consensus sequence? f0415007 Bioinformatics 1 09-26-2011 11:59 PM
samtools/mpileup heterozygous SNPs calling combiochem Bioinformatics 4 08-02-2011 08:05 AM
calling Heterozygous SNPs with samtools mpileup egatti Bioinformatics 1 07-21-2011 09:16 AM
question on SAMtools consensus calling orionzhou Bioinformatics 9 11-16-2010 03:42 PM

Thread Tools
Old 10-06-2011, 08:40 AM   #1
Location: Fife, Scotland

Join Date: Mar 2010
Posts: 18
Unhappy Problem calling consensus sequence with Samtools mpileup?

Hi all, I hope someone can help, have you come across anything like this yourself? I have several bacterial strain BWA genome assemblies. I am using Samtools to call the consensus and have done it twice using 2 different sets of options and come up against 2 problems:

Program: samtools-0.1.18

Command used to call FIRST consensus: samtools mpileup -q 30 -uf ref.fa aln.bam | bcftools view -cg - | vcf2fq > cns.fq
(All FASTQ files converted to FASTA for phylogenetic analysis)

PROBLEM 1: Core genome variable sites were found to have significant homoplasies (significant P values in Splitstree) which was not expected. On closer investigation of the sequence, it was found that in regions with 2 or more SNPs that were closely situated together in relation to the reference, the program had called the reference sequence instead of the SNPs, leading to multiple strains with the same sequence at these sites when in fact they were variants! The SNPs looked to be genuine (In IGV and the VCF file we produced for all sites, there is adequate read depth, mapping quality etc)

To try to rectify this, we disabled BAQ in the next set of consensus calling:
Program: samtools-0.1.18

Command used to call SECOND consensus: samtools mpileup -q 30 -B -uf ref.fa aln.bam | bcftools view -cg - | vcf2fq > cns.fq
(All FASTQ files converted to FASTA for phylogenetic analysis)

PROBLEM2: Now we find that this seems to have rectified the problem where the reference sequence has been called instead of the variants (2 SNPS close together) , but instead, there are an increased number of uncalled bases. On investigation, these seem to be at sites where there may be some reads with a T, for example, and a few reads with an A (eg 2/8 reads are A, while the remaining reads are T, and the reference is T!). In some strains, it's calling 'N', in some, it's calling 'W' and in others it calls the 'T'! (But they all have varying levels of read depth and mapping quality)

See results from the VCF file at one example site (618760) where the ref base is 'T':
Strain 1 uncalled base 'N':
REF 618760 . T . 54 . DP=10;VDB=0.0571;AF1=0;AC1=0;DP4=0,8,0,0;MQ=58;FQ=-51 PL 0[/FONT]
Strain 2 ambiguous base 'W':
REF 618760 . T . 72 . DP=19;VDB=0.0027;AF1=0;AC1=0;DP4=1,13,0,0;MQ=60;FQ=-69 PL 0
Strain 3 called base 'T':
REF 618760 . T . 96 . DP=25;VDB=0.0672;AF1=0;AC1=0;DP4=1,21,0,0;MQ=60;FQ=-93 PL 0


a) What exactly is causing this variation in base calling among the strains in the second set of consensus sequences? (disabled BAQ or is it something I can tell from the VCF file?)
b) What does VDB flag stand for? Can't find any information on it? (IF we remove it from the VCF file and recall the consensus, it will call the base, but we don't want to do this at all~25000 sites with a VDB flag in case it introduces error!)

Any help. is very much appreciated, I just can't understand it, especially why it's having trouble calling the sites with mixed bases, when the read depth and mapping quality are good, and by looking at the assembly you can clearly see which base it should call!

Best Wishes,

Laura Spoor
Lspoor is offline   Reply With Quote
Old 10-06-2011, 02:32 PM   #2
Senior Member
Location: St. Louis

Join Date: Dec 2010
Posts: 535

Try rerunning it and include the -A option. I bet your variant reads are anomalous and hence not being utilized. Evidence for this is that:

REF 618760 . T . 54 . DP=10;DP4=0,8,0,0;
REF 618760 . T . 72 . DP=19;DP4=1,13,0,0;

REF 618760 . T . 96 . DP=25;DP4=1,21,0,0;

The read counts in DP4 do not add up to the read counts in DP, and no variant alleles seem to be included.

I have not been able to figure out what VDB means.
Heisman is offline   Reply With Quote
Old 10-06-2011, 03:19 PM   #3
Senior Member
Location: San Diego

Join Date: May 2008
Posts: 912

It's normal for the DP4 and DP to not match up, as DP4 is reads that pass some filter, I forget what, maybe MQ?

What's odd are the dots for the alt call. I've never seen that, and I don't know what its suppsosed to mean. Maybe if you posted the sam entries for all the reads crossing one of those points.

My samtoools/bcftools made vcf says this about vdb:

##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias">
I'm guessing it means "Hey, this might be a misalignment due to a nearby SNP, and not a real variant"
swbarnes2 is offline   Reply With Quote
Old 10-24-2011, 08:43 AM   #4
Location: Fife, Scotland

Join Date: Mar 2010
Posts: 18

I think the dots in the VCF are classed as a missing values, which I think is because I've generated a VCF file for all sites, rather than just the variants, so it includes regions of invariant sites and also non-mapping regions.

In the end, I found that invoking the -E option in samtools mpileup (this implements extended BAQ computation) increased the sensitivity so that SNPs that were close to each other were called, I guess according to the manual it will decrease specificity slightly but at least I'm not getting a load of false positive homoplasic sites!

I haven't been able to find much information on variance distance bias, other than in the release notes for samtools-0.1.18 where it says:
"Implemented variant distance bias as a new filter (by Petr Danecek)."
Lspoor is offline   Reply With Quote

consensus sequence, mpileup, samtools, vdb

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

All times are GMT -8. The time now is 02:32 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO