Hello:
I am trying to discover SNPs among 24 samples. From all sequence reads combined I have constructed a reference genome (about 205 Mb) and then aligned the individual reads back to the reference genome with bwa-0.6.1. Thus for sample "S3" the commands looked like this.
bwa aln -t 2 Reference02_singleChromosome_index S03FR-allCT.fastq > aln_S03FR-CT_Ref02.sai
bwa samse Reference02_singleChromosome_index aln_S03FR-CT_Eg-Ref02.sai S03FR-allCT.fastq > S03FR-CT_Ref02.sam
After bwa, I used samtools-0.1.18 to call variants:
samtools view -uS S03FR-CT_Ref02.sam > S03FR-CT_Ref02.bam
samtools sort S03FR-CT_Ref02.bam S03FR-CT_Ref02-sorted
samtools index S03FR-CT_Ref02-sorted.bam
So far so good , except that about 40% of the reads could not be aligned... I wonder where they went .
I did so for all 24 samples. Then with mpileup
samtools mpileup -Buf Reference02_singleChromosome.fasta S03FR-CT_Eg-Ref02-sorted.bam S04...all 24 samples | bcftools view -bvcg - > 24_samples.raw.bcf
bcftools view 24_samples.raw.bcf | perl vcfutils.pl varFilter -D100 > 24_samples.flt.vcf
I used IGV-2.0.34 to view the vcf file and the bam files.
What I noticed is that many genuine variants that were called by bwa-samtools and are in the individual -sorted.bam files have not been called by vcftools. The attached file igv_snapshot1.png shows an area where some SNPs are called by both methods... but then next to it are 4 SNPs that are correctly called in the individual alignments, but they have been dropped from the .vcf output. File igv_snapshot2.png shows a SNP that is called heterozygous for all 24 samples in the .vcf output... but actually is based on a very small number of sequence reads, only 4 out of 24 samples are truly heterozygous, for all the other samples there is really no data at all.
What can I improve so that the missing SNPs will show up?
Is there an option in mpileup / vcftools to allow and indicate missing data? It seems to me that the program imputes genotypes for many samples where it actually should not, but should indicate missing data.
Thanks for any help
I am trying to discover SNPs among 24 samples. From all sequence reads combined I have constructed a reference genome (about 205 Mb) and then aligned the individual reads back to the reference genome with bwa-0.6.1. Thus for sample "S3" the commands looked like this.
bwa aln -t 2 Reference02_singleChromosome_index S03FR-allCT.fastq > aln_S03FR-CT_Ref02.sai
bwa samse Reference02_singleChromosome_index aln_S03FR-CT_Eg-Ref02.sai S03FR-allCT.fastq > S03FR-CT_Ref02.sam
After bwa, I used samtools-0.1.18 to call variants:
samtools view -uS S03FR-CT_Ref02.sam > S03FR-CT_Ref02.bam
samtools sort S03FR-CT_Ref02.bam S03FR-CT_Ref02-sorted
samtools index S03FR-CT_Ref02-sorted.bam
So far so good , except that about 40% of the reads could not be aligned... I wonder where they went .
I did so for all 24 samples. Then with mpileup
samtools mpileup -Buf Reference02_singleChromosome.fasta S03FR-CT_Eg-Ref02-sorted.bam S04...all 24 samples | bcftools view -bvcg - > 24_samples.raw.bcf
bcftools view 24_samples.raw.bcf | perl vcfutils.pl varFilter -D100 > 24_samples.flt.vcf
I used IGV-2.0.34 to view the vcf file and the bam files.
What I noticed is that many genuine variants that were called by bwa-samtools and are in the individual -sorted.bam files have not been called by vcftools. The attached file igv_snapshot1.png shows an area where some SNPs are called by both methods... but then next to it are 4 SNPs that are correctly called in the individual alignments, but they have been dropped from the .vcf output. File igv_snapshot2.png shows a SNP that is called heterozygous for all 24 samples in the .vcf output... but actually is based on a very small number of sequence reads, only 4 out of 24 samples are truly heterozygous, for all the other samples there is really no data at all.
What can I improve so that the missing SNPs will show up?
Is there an option in mpileup / vcftools to allow and indicate missing data? It seems to me that the program imputes genotypes for many samples where it actually should not, but should indicate missing data.
Thanks for any help