I'm generating FASTQ consensus file with the samtools mpileup | bcftools view | vcfutils.pl vcf2fq chain from BAM files generated with BWA. These BAM files are alignments of Illumina reads to a mitochondrial (haploid) reference sequence.
What I would like is a FASTQ file where the consensus base calls are made based on the most common (high quality) base in the reads at that site. That is, I don't want any ambiguity codes in my sequence. Alternatively, I'd like to be able to set a frequency threshold that a base has to exist at to be called, otherwise it is set to N.
I have played with all of the parameters to samtools mpileup, bcftools view, and vcfutils.pl vcf2fq focusing on bcftools view's '-p', the variant probability threshold, but I can't get it to do what I want as it seems to be designed for SNP calling in diploid data and thus potentially valid heterozygous bases are called as such rather than being able to force a "homozygous" call of the most common base.
As an example, I have a site that has 8 C's and 17 T's, all at high quality. I would like the consensus sequence of this site to have T rather than Y.
In a perfect world, if I had a site with 8 C's and 8 T's, I'd like the option of making it either an N or calling it whatever base the reference has.
If the mpileup -> bcftools -> vcf2fq chain won't do this, is there some other command line software that will do this from a BAM file? I need it to be command line driven because I'm doing this for 100s of BAM files as part of a batch pipeline I've written in R.
Thanks in advance for any help.
What I would like is a FASTQ file where the consensus base calls are made based on the most common (high quality) base in the reads at that site. That is, I don't want any ambiguity codes in my sequence. Alternatively, I'd like to be able to set a frequency threshold that a base has to exist at to be called, otherwise it is set to N.
I have played with all of the parameters to samtools mpileup, bcftools view, and vcfutils.pl vcf2fq focusing on bcftools view's '-p', the variant probability threshold, but I can't get it to do what I want as it seems to be designed for SNP calling in diploid data and thus potentially valid heterozygous bases are called as such rather than being able to force a "homozygous" call of the most common base.
As an example, I have a site that has 8 C's and 17 T's, all at high quality. I would like the consensus sequence of this site to have T rather than Y.
In a perfect world, if I had a site with 8 C's and 8 T's, I'd like the option of making it either an N or calling it whatever base the reference has.
If the mpileup -> bcftools -> vcf2fq chain won't do this, is there some other command line software that will do this from a BAM file? I need it to be command line driven because I'm doing this for 100s of BAM files as part of a batch pipeline I've written in R.
Thanks in advance for any help.
Comment