Fluctuating Genotype Quality scores

IanWilson

Junior Member

Join Date: Apr 2014

Posts: 2
- Share
- Tweet
#1

Fluctuating Genotype Quality scores

09-16-2014, 11:24 AM

Hi everyone,

I have DNA reads generated on the SOLiD 4 sequencer and Illumina MiSeq platform. The reads are from multiple strains of 3 species. Only one platform has been used for each strain. I've aligned the reads to their respective reference sequences using BWA and then called SNPs using samtools (commands below). So far, so good.

I've come to plot divergence from the reference sequences as a function of genotype quality (basically, the number of times each GQ score is seen relative to genome size). Indels have been removed from the SAMtools output and I've taken the values from SAMtools' resulting vcf file and extracted the 'GQ' field's scores for each SNP. The vast majority of GQ scores are an encouraging '99'. However, for the relative few that have been given scores below this, I see very strange patterns in their frequency. Every 7th GQ score (3, 10, 17 etc) the frequency shoots up massively, then drops down again.

I can't find any reason for this or anyone that's found this before. Perhaps I'm missing something very obvious but any help would be greatly appreciated! Anybody know why this might happen?

Thanks very much,

Ian

###########
###########
###########

Commands used:

bwa aln -n 4 -R 10 -t 10 -c ./REF_SEQ_SOLID/BWA_HM1 ~/Hist_Comparisons/2_SRA_FILES/HM1_A.fastq > ./READS/HM1_A.sai;

bwa samse -r '@RG\tID:HM1\tPL:SOLiD\tLB:S\tSM:HM1_A' -n 2 ./REF_SEQ_SOLID/BWA_HM1 ./READS/HM1_A.sai ~/Hist_Comparisons/2_SRA_FILES/HM1_A.fastq > ./READS/HM1_A.sam

awk '$2 != 4 {print $0}' HM1_A.sam > HM1_Afiltered.sam;

cat HM1_Afiltered.sam | perl -ne 'if ($_ =~ /^@/) {print $_} elsif ($_ =~ /^\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+XT\:A\:U/) {print $_}' > HM1_Afiltered_UNIQ.sam;

samtools view -uT ~/Hist_Comparisons/3_BWA/REF_SEQ_SOLID/BWA_HM1.fa HM1_Afiltered_UNIQ.sam | \
samtools sort - ~/Hist_Comparisons/4_GATK/STRAINS/HM1_A/HM1_Afiltered_UNIQ_sort;
samtools index ~/Hist_Comparisons/4_GATK/STRAINS/HM1_A/HM1_Afiltered_UNIQ_sort.bam;

## call bases and snps
## Make a vcf and a bcf file containing SNPs, indels and non variants (i.e. sites without polymorphisms):

samtools mpileup -uD -f ~/Hist_Comparisons/3_BWA/REF_SEQ_SOLID/BWA_HM1.fa HM1_Afiltered_UNIQ_sort.bam | bcftools view -bcg - > HM1_samtools.raw.bcf
samtools mpileup -uD -f ~/Hist_Comparisons/3_BWA/REF_SEQ_SOLID/BWA_HM1.fa HM1_Afiltered_UNIQ_sort.bam | bcftools view -cg - > HM1_samtools.raw.vcf

## calculate q50 and q95 of coverage
cat HM1_samtools.raw.vcf | grep -v 'INDEL' | perl -ne 'if ($_ =~ /^\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+([0-9]*\.?[0-9]+)\s+\S+\s+DP=(\d+)/) {if ($1 >= 20) {print $2."\n"}}' | sort -n | perl -e '$q50=.5; $q95=.95; @l=<>; print "$l[int($q50*$#l)]"; print "$l[int($q95*$#l)]";'

## 34
## 141

## Make a vcf file containing only SNPs with a minimum quality score of 20, a minimum depth of 5, a maximum depth of 141,
## SNPs farther than -w bp from a gap, using a window size of 30.
## These are the closest values it’s possible to get now:

bcftools view HM1_samtools.raw.bcf | vcfutils.pl varFilter -Q 20 -d 5 -D 141 -w 5 -W 30 > HM1_filtered.vcf;

Last edited by IanWilson; 09-17-2014, 12:59 AM.
Tags: bcftools, genotype quality, samtools, vcf

Previous template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
Yesterday, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 57 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 48 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 55 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Fluctuating Genotype Quality scores

Latest Articles

ad_right_rmr

News