Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Fluctuating Genotype Quality scores

    Hi everyone,

    I have DNA reads generated on the SOLiD 4 sequencer and Illumina MiSeq platform. The reads are from multiple strains of 3 species. Only one platform has been used for each strain. I've aligned the reads to their respective reference sequences using BWA and then called SNPs using samtools (commands below). So far, so good.

    I've come to plot divergence from the reference sequences as a function of genotype quality (basically, the number of times each GQ score is seen relative to genome size). Indels have been removed from the SAMtools output and I've taken the values from SAMtools' resulting vcf file and extracted the 'GQ' field's scores for each SNP. The vast majority of GQ scores are an encouraging '99'. However, for the relative few that have been given scores below this, I see very strange patterns in their frequency. Every 7th GQ score (3, 10, 17 etc) the frequency shoots up massively, then drops down again.

    I can't find any reason for this or anyone that's found this before. Perhaps I'm missing something very obvious but any help would be greatly appreciated! Anybody know why this might happen?

    Thanks very much,

    Ian


    ###########
    ###########
    ###########

    Commands used:

    bwa aln -n 4 -R 10 -t 10 -c ./REF_SEQ_SOLID/BWA_HM1 ~/Hist_Comparisons/2_SRA_FILES/HM1_A.fastq > ./READS/HM1_A.sai;

    bwa samse -r '@RG\tID:HM1\tPL:SOLiD\tLB:S\tSM:HM1_A' -n 2 ./REF_SEQ_SOLID/BWA_HM1 ./READS/HM1_A.sai ~/Hist_Comparisons/2_SRA_FILES/HM1_A.fastq > ./READS/HM1_A.sam

    awk '$2 != 4 {print $0}' HM1_A.sam > HM1_Afiltered.sam;

    cat HM1_Afiltered.sam | perl -ne 'if ($_ =~ /^@/) {print $_} elsif ($_ =~ /^\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+XT\:A\:U/) {print $_}' > HM1_Afiltered_UNIQ.sam;

    samtools view -uT ~/Hist_Comparisons/3_BWA/REF_SEQ_SOLID/BWA_HM1.fa HM1_Afiltered_UNIQ.sam | \
    samtools sort - ~/Hist_Comparisons/4_GATK/STRAINS/HM1_A/HM1_Afiltered_UNIQ_sort;
    samtools index ~/Hist_Comparisons/4_GATK/STRAINS/HM1_A/HM1_Afiltered_UNIQ_sort.bam;


    ## call bases and snps
    ## Make a vcf and a bcf file containing SNPs, indels and non variants (i.e. sites without polymorphisms):

    samtools mpileup -uD -f ~/Hist_Comparisons/3_BWA/REF_SEQ_SOLID/BWA_HM1.fa HM1_Afiltered_UNIQ_sort.bam | bcftools view -bcg - > HM1_samtools.raw.bcf
    samtools mpileup -uD -f ~/Hist_Comparisons/3_BWA/REF_SEQ_SOLID/BWA_HM1.fa HM1_Afiltered_UNIQ_sort.bam | bcftools view -cg - > HM1_samtools.raw.vcf

    ## calculate q50 and q95 of coverage
    cat HM1_samtools.raw.vcf | grep -v 'INDEL' | perl -ne 'if ($_ =~ /^\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+([0-9]*\.?[0-9]+)\s+\S+\s+DP=(\d+)/) {if ($1 >= 20) {print $2."\n"}}' | sort -n | perl -e '$q50=.5; $q95=.95; @l=<>; print "$l[int($q50*$#l)]"; print "$l[int($q95*$#l)]";'

    ## 34
    ## 141

    ## Make a vcf file containing only SNPs with a minimum quality score of 20, a minimum depth of 5, a maximum depth of 141,
    ## SNPs farther than -w bp from a gap, using a window size of 30.
    ## These are the closest values it’s possible to get now:

    bcftools view HM1_samtools.raw.bcf | vcfutils.pl varFilter -Q 20 -d 5 -D 141 -w 5 -W 30 > HM1_filtered.vcf;
    Last edited by IanWilson; 09-17-2014, 12:59 AM.

Latest Articles

Collapse

  • seqadmin
    Essential Discoveries and Tools in Epitranscriptomics
    by seqadmin




    The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
    Yesterday, 07:01 AM
  • seqadmin
    Current Approaches to Protein Sequencing
    by seqadmin


    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
    04-04-2024, 04:25 PM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 04-11-2024, 12:08 PM
0 responses
59 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 10:19 PM
0 responses
57 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 09:21 AM
0 responses
48 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-04-2024, 09:00 AM
0 responses
55 views
0 likes
Last Post seqadmin  
Working...
X