We are doing SNP discovery from Illumina reads assembled to a set of ~80 loci, from ~25 individuals. After generating the BAM and VCF files with Samtools, we use vcf2fq to generate the Fastq files from the VCF files. In general this works well, but I have found that when there is only a single read at a site, if it differs from the reference sequence, it is still called as the reference sequence rather than the alternate base (or N). Does anyone know what parameter would be changed to generate the correct Fastq sequence even when there is only one read, or to change it to an N when it differs from the reference with only one read?
I'm not the person who put all of these scripts together, so I may have explained our analysis pipeline incorrectly, but the important part is getting the appropriate function out of vcf2fq.
I'm not the person who put all of these scripts together, so I may have explained our analysis pipeline incorrectly, but the important part is getting the appropriate function out of vcf2fq.