Hi,
I am trying to identify indels and am having trouble understanding the values in column 5 and 6 of my pileup output (see example of lines in final_aln_ivcf.pileup). As I understand column 5 in final_aln_ivcf.pileup should be the Phred-scaled consensus quality and column 6 the SNP quality (i.e. the Phred-scaled probability of the consensus being identical to the reference) see http://samtools.sourceforge.net/samtools.shtml.
However what confuses me is that the values in column 5 and 6 are much higher than I would expect for Phred-scaled probabiliies (see below example of lines in final_aln_ivcf.pileup).
It looks as though in this pileup, a reliable call for an indel has got an SNP quality of >1000 (see chr1 470390), SNP quality of around 500 seems to indicate half the sequences are like the reference and half show the indel (see chr1 427504) while SNP quality scores below 50 seem to suggest almost all sequences are identical to reference with few sequences showing presence of an indel (see chr1 453369, 456726).
Now I am wondering whether/or not I have made a mistake somewhere in the analysis, whether/or not my interpretation of this result is right and how to explain these strange very high values for Phred-based probability of consensus and SNP quality....Can anybody help?
Quick summary how the analysis was conducted to get the pileup below:
I initially used bwa to align paired-end solexa data to a reference:
bwa index -a bwtsw reference.fasta
bwa aln reference.fasta 1.fastq > aln.sai1
bwa aln reference.fasta 2.fastq > aln.sai2
bwa sampe -a 250 reference.fasta aln.sai1 aln.sai2 1.fastq 2.fastq > aln.sam
then converted the .sam file to .bam file, followed by sorting and indexing using samtools:
samtools view -bt reference.fasta aln.sam -o aln.bam
samtools sort aln.bam aln_sorted.bam
samtools index aln_sorted.bam
Then I used samtools pileup to extract the indels:
samtools pileup -i -vcf reference.fasta aln_sorted.bam > aln_ivcf.pileup
samtools.pl varFilter aln_ivcf.pileup | awk '($3=="*" && $6>=20 && $7>=20 && $8>=10)'> final_aln_ivcf.pileup
examples of lines in final_aln_ivcf.pileup
chr_1 427504 * +T/+T 9 498 57 22 +T * 9 13 0 0 0
chr_1 441613 * */-C 22 22 53 37 * -C 34 3 0 0 0
chr_1 453369 * */-A 125 125 48 22 * -A 20 1 1 1 0
chr_1 456726 * */-T 27 27 52 38 * -T 36 1 1 1 0
chr_1 470390 * -C/-C 111 1021 44 28 -C * 24 4 0 0
column headers in final_aln_ivcf.pileup according to http://samtools.sourceforge.net/samtools.shtml with option -c are as follows:
chromosome name, coordinate, a star, the genotype, consensus quality, SNP quality, RMS mapping quality, # covering reads, the first allele, the second allele, # reads supporting the first allele, # reads supporting the second allele and # reads containing indels different from the top two alleles.
I am trying to identify indels and am having trouble understanding the values in column 5 and 6 of my pileup output (see example of lines in final_aln_ivcf.pileup). As I understand column 5 in final_aln_ivcf.pileup should be the Phred-scaled consensus quality and column 6 the SNP quality (i.e. the Phred-scaled probability of the consensus being identical to the reference) see http://samtools.sourceforge.net/samtools.shtml.
However what confuses me is that the values in column 5 and 6 are much higher than I would expect for Phred-scaled probabiliies (see below example of lines in final_aln_ivcf.pileup).
It looks as though in this pileup, a reliable call for an indel has got an SNP quality of >1000 (see chr1 470390), SNP quality of around 500 seems to indicate half the sequences are like the reference and half show the indel (see chr1 427504) while SNP quality scores below 50 seem to suggest almost all sequences are identical to reference with few sequences showing presence of an indel (see chr1 453369, 456726).
Now I am wondering whether/or not I have made a mistake somewhere in the analysis, whether/or not my interpretation of this result is right and how to explain these strange very high values for Phred-based probability of consensus and SNP quality....Can anybody help?
Quick summary how the analysis was conducted to get the pileup below:
I initially used bwa to align paired-end solexa data to a reference:
bwa index -a bwtsw reference.fasta
bwa aln reference.fasta 1.fastq > aln.sai1
bwa aln reference.fasta 2.fastq > aln.sai2
bwa sampe -a 250 reference.fasta aln.sai1 aln.sai2 1.fastq 2.fastq > aln.sam
then converted the .sam file to .bam file, followed by sorting and indexing using samtools:
samtools view -bt reference.fasta aln.sam -o aln.bam
samtools sort aln.bam aln_sorted.bam
samtools index aln_sorted.bam
Then I used samtools pileup to extract the indels:
samtools pileup -i -vcf reference.fasta aln_sorted.bam > aln_ivcf.pileup
samtools.pl varFilter aln_ivcf.pileup | awk '($3=="*" && $6>=20 && $7>=20 && $8>=10)'> final_aln_ivcf.pileup
examples of lines in final_aln_ivcf.pileup
chr_1 427504 * +T/+T 9 498 57 22 +T * 9 13 0 0 0
chr_1 441613 * */-C 22 22 53 37 * -C 34 3 0 0 0
chr_1 453369 * */-A 125 125 48 22 * -A 20 1 1 1 0
chr_1 456726 * */-T 27 27 52 38 * -T 36 1 1 1 0
chr_1 470390 * -C/-C 111 1021 44 28 -C * 24 4 0 0
column headers in final_aln_ivcf.pileup according to http://samtools.sourceforge.net/samtools.shtml with option -c are as follows:
chromosome name, coordinate, a star, the genotype, consensus quality, SNP quality, RMS mapping quality, # covering reads, the first allele, the second allele, # reads supporting the first allele, # reads supporting the second allele and # reads containing indels different from the top two alleles.
Comment