Hi everyone,
I have DNA reads generated on the SOLiD 4 sequencer and Illumina MiSeq platform. The reads are from multiple strains of 3 species. Only one platform has been used for each strain. I've aligned the reads to their respective reference sequences using BWA and then called SNPs using samtools (commands below). So far, so good.
I've come to plot divergence from the reference sequences as a function of genotype quality (basically, the number of times each GQ score is seen relative to genome size). Indels have been removed from the SAMtools output and I've taken the values from SAMtools' resulting vcf file and extracted the 'GQ' field's scores for each SNP. The vast majority of GQ scores are an encouraging '99'. However, for the relative few that have been given scores below this, I see very strange patterns in their frequency. Every 7th GQ score (3, 10, 17 etc) the frequency shoots up massively, then drops down again.
I can't find any reason for this or anyone that's found this before. Perhaps I'm missing something very obvious but any help would be greatly appreciated! Anybody know why this might happen?
Thanks very much,
Ian
###########
###########
###########
Commands used:
bwa aln -n 4 -R 10 -t 10 -c ./REF_SEQ_SOLID/BWA_HM1 ~/Hist_Comparisons/2_SRA_FILES/HM1_A.fastq > ./READS/HM1_A.sai;
bwa samse -r '@RG\tID:HM1\tPL:SOLiD\tLB:S\tSM:HM1_A' -n 2 ./REF_SEQ_SOLID/BWA_HM1 ./READS/HM1_A.sai ~/Hist_Comparisons/2_SRA_FILES/HM1_A.fastq > ./READS/HM1_A.sam
awk '$2 != 4 {print $0}' HM1_A.sam > HM1_Afiltered.sam;
cat HM1_Afiltered.sam | perl -ne 'if ($_ =~ /^@/) {print $_} elsif ($_ =~ /^\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+XT\:A\:U/) {print $_}' > HM1_Afiltered_UNIQ.sam;
samtools view -uT ~/Hist_Comparisons/3_BWA/REF_SEQ_SOLID/BWA_HM1.fa HM1_Afiltered_UNIQ.sam | \
samtools sort - ~/Hist_Comparisons/4_GATK/STRAINS/HM1_A/HM1_Afiltered_UNIQ_sort;
samtools index ~/Hist_Comparisons/4_GATK/STRAINS/HM1_A/HM1_Afiltered_UNIQ_sort.bam;
## call bases and snps
## Make a vcf and a bcf file containing SNPs, indels and non variants (i.e. sites without polymorphisms):
samtools mpileup -uD -f ~/Hist_Comparisons/3_BWA/REF_SEQ_SOLID/BWA_HM1.fa HM1_Afiltered_UNIQ_sort.bam | bcftools view -bcg - > HM1_samtools.raw.bcf
samtools mpileup -uD -f ~/Hist_Comparisons/3_BWA/REF_SEQ_SOLID/BWA_HM1.fa HM1_Afiltered_UNIQ_sort.bam | bcftools view -cg - > HM1_samtools.raw.vcf
## calculate q50 and q95 of coverage
cat HM1_samtools.raw.vcf | grep -v 'INDEL' | perl -ne 'if ($_ =~ /^\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+([0-9]*\.?[0-9]+)\s+\S+\s+DP=(\d+)/) {if ($1 >= 20) {print $2."\n"}}' | sort -n | perl -e '$q50=.5; $q95=.95; @l=<>; print "$l[int($q50*$#l)]"; print "$l[int($q95*$#l)]";'
## 34
## 141
## Make a vcf file containing only SNPs with a minimum quality score of 20, a minimum depth of 5, a maximum depth of 141,
## SNPs farther than -w bp from a gap, using a window size of 30.
## These are the closest values it’s possible to get now:
bcftools view HM1_samtools.raw.bcf | vcfutils.pl varFilter -Q 20 -d 5 -D 141 -w 5 -W 30 > HM1_filtered.vcf;
I have DNA reads generated on the SOLiD 4 sequencer and Illumina MiSeq platform. The reads are from multiple strains of 3 species. Only one platform has been used for each strain. I've aligned the reads to their respective reference sequences using BWA and then called SNPs using samtools (commands below). So far, so good.
I've come to plot divergence from the reference sequences as a function of genotype quality (basically, the number of times each GQ score is seen relative to genome size). Indels have been removed from the SAMtools output and I've taken the values from SAMtools' resulting vcf file and extracted the 'GQ' field's scores for each SNP. The vast majority of GQ scores are an encouraging '99'. However, for the relative few that have been given scores below this, I see very strange patterns in their frequency. Every 7th GQ score (3, 10, 17 etc) the frequency shoots up massively, then drops down again.
I can't find any reason for this or anyone that's found this before. Perhaps I'm missing something very obvious but any help would be greatly appreciated! Anybody know why this might happen?
Thanks very much,
Ian
###########
###########
###########
Commands used:
bwa aln -n 4 -R 10 -t 10 -c ./REF_SEQ_SOLID/BWA_HM1 ~/Hist_Comparisons/2_SRA_FILES/HM1_A.fastq > ./READS/HM1_A.sai;
bwa samse -r '@RG\tID:HM1\tPL:SOLiD\tLB:S\tSM:HM1_A' -n 2 ./REF_SEQ_SOLID/BWA_HM1 ./READS/HM1_A.sai ~/Hist_Comparisons/2_SRA_FILES/HM1_A.fastq > ./READS/HM1_A.sam
awk '$2 != 4 {print $0}' HM1_A.sam > HM1_Afiltered.sam;
cat HM1_Afiltered.sam | perl -ne 'if ($_ =~ /^@/) {print $_} elsif ($_ =~ /^\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+XT\:A\:U/) {print $_}' > HM1_Afiltered_UNIQ.sam;
samtools view -uT ~/Hist_Comparisons/3_BWA/REF_SEQ_SOLID/BWA_HM1.fa HM1_Afiltered_UNIQ.sam | \
samtools sort - ~/Hist_Comparisons/4_GATK/STRAINS/HM1_A/HM1_Afiltered_UNIQ_sort;
samtools index ~/Hist_Comparisons/4_GATK/STRAINS/HM1_A/HM1_Afiltered_UNIQ_sort.bam;
## call bases and snps
## Make a vcf and a bcf file containing SNPs, indels and non variants (i.e. sites without polymorphisms):
samtools mpileup -uD -f ~/Hist_Comparisons/3_BWA/REF_SEQ_SOLID/BWA_HM1.fa HM1_Afiltered_UNIQ_sort.bam | bcftools view -bcg - > HM1_samtools.raw.bcf
samtools mpileup -uD -f ~/Hist_Comparisons/3_BWA/REF_SEQ_SOLID/BWA_HM1.fa HM1_Afiltered_UNIQ_sort.bam | bcftools view -cg - > HM1_samtools.raw.vcf
## calculate q50 and q95 of coverage
cat HM1_samtools.raw.vcf | grep -v 'INDEL' | perl -ne 'if ($_ =~ /^\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+([0-9]*\.?[0-9]+)\s+\S+\s+DP=(\d+)/) {if ($1 >= 20) {print $2."\n"}}' | sort -n | perl -e '$q50=.5; $q95=.95; @l=<>; print "$l[int($q50*$#l)]"; print "$l[int($q95*$#l)]";'
## 34
## 141
## Make a vcf file containing only SNPs with a minimum quality score of 20, a minimum depth of 5, a maximum depth of 141,
## SNPs farther than -w bp from a gap, using a window size of 30.
## These are the closest values it’s possible to get now:
bcftools view HM1_samtools.raw.bcf | vcfutils.pl varFilter -Q 20 -d 5 -D 141 -w 5 -W 30 > HM1_filtered.vcf;