Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Local drop in Phred score around SNPs in mpileup -- are these false positives?

    Hi all,
    I am consistently seeing the following pattern that alerts me to a systematic problem in some step in my sequencing/filtering pipeline. I used the Illumina MiSeq to deep sequence a mixed bacterial population, and did this in two ways -- I prepared three Nextera XT libraries on the same day (technical replicates) and also six months apart from each other (different-day-replicates). I am aligning to a reference genome using bwa, and then creating an mpileup file using samtools mpileup -f REF.fna > output.mpileup. When call SNPs (I am requiring > 12X coverage and > 90% of the reads must be the same variant base) from the technical replicates, about 99% of positions agree across all three technical replicate libraries (around 400 sites). When I compare the different-day-replicates (same tube of genomic DNA prepared 6 months apart), there are around 300 sites that are discordant. I am now trying to figure out if there is a systematic difference between the discordant and concordant SNPs.
    Many of the SNPs, in both the discordant and concordant categories, follow this pattern:

    Here's a library where position 3448348 is a SNP:

    gi|448814763|ref|NC_000962.3| 3448343 C 25 .,,,,,,...,,,,.,,,,.,,... HCCGEGGHFGGECGHGGGGHGGGFA
    gi|448814763|ref|NC_000962.3| 3448344 G 25 .,,,,,,...,,,,.,,,,.,,... GFFGFGGGFGHHFHGHHHHGHHFCB
    gi|448814763|ref|NC_000962.3| 3448345 G 26 .,,,,,,...,,,,.,,,,.,,...^], GCCGGCGGDGGGGGGGGGGGDGEBA8
    gi|448814763|ref|NC_000962.3| 3448346 C 26 .,,,,,,...,,,,.,,,,.,,..., GBBGGEGG@CGEGGGGGGGG/GAGB9
    gi|448814763|ref|NC_000962.3| 3448347 A 26 .$,,,,,,...,,,,.,,,,.,,..., E>==9999<99F9999999999?9>6
    gi|448814763|ref|NC_000962.3| 3448348 G 25 aaaaaaAAAaaaaAaaaaAaaAAAa >==9999=9919999999999?9>6
    gi|448814763|ref|NC_000962.3| 3448349 C 25 ,,,,,,...,,,,.,,,,.,,..., CCCAGGHHCGEGGHGGGGHGGFGB<
    gi|448814763|ref|NC_000962.3| 3448350 G 26 ,,,,,,...,,,,.,,,,.,,...,^]. CCCGFGGG@GEGGGGGGGGGGFGBGA
    gi|448814763|ref|NC_000962.3| 3448351 G 26 ,,,,,,...,,,,.,,,,.,,...,. BCBGGGGCGGEGGGGGGCGCG?GBGB
    gi|448814763|ref|NC_000962.3| 3448352 G 26 ,$,,,,,...,,,,.,,,,.,,...,. @CCGEGGGGGGFGGGGGFGCGEGBGC
    gi|448814763|ref|NC_000962.3| 3448353 T 26 ,,,,,...,,,,.,,,,.,,...,.^>. CCB>GGDGGEGGGGGGGG@GGGGGA7


    And a library where it's not a SNP:

    gi|448814763|ref|NC_000962.3| 3448343 C 26 ..,..,...,,,.,.,..,,.,.,,, HHDGHGHHHGCGHGH/HH</HGGGDG
    gi|448814763|ref|NC_000962.3| 3448344 G 26 ..,..,...,,,.,.,..,,.,.,,, GGCGGGGAGHHHGHG0GF0FGHFGFH
    gi|448814763|ref|NC_000962.3| 3448345 G 26 ..,..,...,,,.,.,..,,.,.,,, GGAGGGFAGGGGGGG?GG/GGGGGDG
    gi|448814763|ref|NC_000962.3| 3448346 C 26 ..,..,...,,,.,.,..,,.,.,,, GG@GGGCCGGGEGGG/GG/EGGGGFG
    gi|448814763|ref|NC_000962.3| 3448347 A 26 ..,..,...,,,.,.,..,,.,.,,, GCDGGGCGGGG?GGG/GG<CGGGGFG
    gi|448814763|ref|NC_000962.3| 3448348 G 26 .$.,..,...,,,.,.,..c,.,.,,, CFAGGCF@GGGEGGG/GG/GGGGGGG
    gi|448814763|ref|NC_000962.3| 3448349 C 25 .,..,...,,,.,.,..,,.,.,,, GAGGCGFGGFGGGG/GGEEGGGGGG
    gi|448814763|ref|NC_000962.3| 3448350 G 25 .,..,...,,,.,.,..,,.,.,,, G>GGCGGGGEGGGG>GG>EGGGGGG
    gi|448814763|ref|NC_000962.3| 3448351 G 25 .,..,...,,,.,.,..,,.,.,,, G3GGCGGGG0GGGG>GG>>GHGGGG
    gi|448814763|ref|NC_000962.3| 3448352 G 26 .,..,...,,,.,.,..,,.,.,,,^], G3FGCGGGGGGGGG/GG//GHGGCGE
    gi|448814763|ref|NC_000962.3| 3448353 T 26 .$,..,...,,,.,.c..,,.,.,,,, :3GGCGGGGGEGGG/GG//GHGG@GG
    gi|448814763|ref|NC_000962.3| 3448354 C 26 ,$..,...,,,.,.,..,,.,.,,,,^], >EECGGGGGCGGG/GG//G2GGCGGB
    gi|448814763|ref|NC_000962.3| 3448355 C 25 .$.$,...,,,.,.,..,,.a.,,,,, AACHHAGGGHGH/HH//H-HGCGGE
    gi|448

    Example 2: Position 67792:

    gi|448814763|ref|NC_000962.3| 67788 A 19 ..,,,,,,,..,,.,..,. GGFFGFGGGGCHHCGCBGC
    gi|448814763|ref|NC_000962.3| 67789 T 19 ..,,,,,,,..,,.,..,. GGCDGGGGGGGGGFGCCGC
    gi|448814763|ref|NC_000962.3| 67790 T 19 ..,,,,,,,..,,.,..,. GCCBGGGGGGGGGGGCCGD
    gi|448814763|ref|NC_000962.3| 67791 G 20 .$.,,,,,,,..,,.,..,.^]. EHCBGGGGGHHGGHGFFGF1
    gi|448814763|ref|NC_000962.3| 67792 C 19 TtttttttTTttTtTTtTT 8=>9:9998899999:9:0
    gi|448814763|ref|NC_000962.3| 67793 C 19 .,,,,,,,..,,.,..,.. 8=>9:9998899999:9:;
    gi|448814763|ref|NC_000962.3| 67794 G 19 .,,,,,,,..,,.,..,.. GFFGGGGGGGGHAHGGHDA


    library where it's not a SNP:

    gi|448814763|ref|NC_000962.3| 67788 A 47 ,$,$.....,,.,,,,,,....,.,,.....,,,,........,,.,.. B1GGGGGFG@GGGG1GGGC/HGH1EGGGGFFHHGGGGGGDDGGBGCA
    gi|448814763|ref|NC_000962.3| 67789 T 45 .....,,.,,,,,,....,.,,.....,,,,........,,.,.. GGGGGDBDGGCGG?GGG/EGE/GGGGEGCGGGGGGGGCAGGBGCB
    gi|448814763|ref|NC_000962.3| 67790 T 45 .....,,.,,,,,,....,.,,.....,,,,........,,.,.. GGGGGCBFGGGEGGGGG/?GE/GGGGGG@GGGGGGGGC?GGBGCB
    gi|448814763|ref|NC_000962.3| 67791 G 45 .....,,.,,,,,,....,.,,.....,,,,........,,.,.. HHGHHCBHGGGGEGHHH1?FE/HGHHBGGGGHFHHHGFFF@FGFF
    gi|448814763|ref|NC_000962.3| 67792 C 46 .....,,.,,,,,,....,.,,.....,,,,........,,.,..^], HHHHHBBHGGGGEGHHG1EH@/GHHHGGGGCHBHHHHGDGCFGFF@
    gi|448814763|ref|NC_000962.3| 67793 C 46 .....,,.,,,,,,....,.,,.....,,,,........,,.,.., HHHHHBBHGFFGEGHHH1EHCEFHHHGGGEGHEHHHHGGGCFGFF-
    gi|448814763|ref|NC_000962.3| 67794 G 46 .....,,.,,,,,,...C,.,,.....,,,,........,,.,.., GGGGGFFGGGGGBGGGC1HGFGECGGGHHGGG/GGGGGEHHBECBF
    gi|448814763|ref|NC_000962.3| 67795 C 46 .$....,,.,,,,,,....,.,,.....,,,,........,,.,.., EGGGGCFGFFGFGFGGG<HGHAGCGGGHHFHGEGGGGGEFGBHCBB


    Two things are weird. In the first library in each example, the variant position has a local drop in Phred quality score compared to the rest of the positions, AND the position just above it or just below it seems to have the exact same (lower) Phred scores. But this position is not a SNP in the second library, and doesn't have the same drop in Phred sores.

    A contrasting example -- this position is discordant between the different runs, but doesn't show the same drop in Phred scores. Here's a library where it's a SNP:

    gi|448814763|ref|NC_000962.3| 23611 G 41 ,,......,,,,,,.,,,,,*,,.,.,,..,,,.,..,,,. CAGGGGGGGGEGGGGGGGGF!GGGGEGGGGGGGGGACGCD'
    gi|448814763|ref|NC_000962.3| 23612 G 41 ,,......,,,,,,.,,,,,*,,.,.,,..,,,.,..,,,. BAGGC/GGGFAGGGGGGGGA!GGGGEFEGGGGGGG?CDGD;
    gi|448814763|ref|NC_000962.3| 23613 C 41 ,,......,,,,,,.,,,,,*,,.,.,,..,,,.,..,,,. BAGGC<GGGFBGGEGGGGF/!GGGGGE?GGGEEGGAC/D<G
    gi|448814763|ref|NC_000962.3| 23614 G 41 ,$,......,,,,,,.,,,,,*,,.,.,,..,,,.,..,,,. CFGGF<GGHHHGHGGHHHHF!HHGHCHGGFGGHGHECGG1G
    gi|448814763|ref|NC_000962.3| 23615 C 40 ,......,,,,,,.,,,,,*,,.,.,,..,,,.,..,,,. FGGGCCGHGHHHDGHHFHF!GGGHFHFGGHBHGHECBF1G
    gi|448814763|ref|NC_000962.3| 23616 A 40 ,......,,,,,,.,,,,,*,,.,.,,..,,,.,..,,,. FGGFCGGHHHHHHGHHHHF!HHGHEH2GEFHHGH0GFFGG
    gi|448814763|ref|NC_000962.3| 23617 T 40 ,......,,,,,,.,,,,,*,,.,.,,..,,,.,..,,,. BGG??DGHHHHHGGFHHHG!FHGHGHDGEHHHGH0CGGFG
    gi|448814763|ref|NC_000962.3| 23618 C 40 gGGGGGGggggggGggggg*ggGgGggGGgggGgGGgggG BGGGCGGHHHHHHGHHHHF!HHGHGHFGEHGHGHEGHHGG
    gi|448814763|ref|NC_000962.3| 23619 A 40 ,......,,,,,,.,,,,,*,,.,.,,..,,,.,..,,,. BHHHHHHHHHHHHFHHHH@!HHHHGH2HHGFHHH1GGHGH
    gi|448814763|ref|NC_000962.3| 23620 T 40 ,......,,,,,,.,,,,,*,,.,.,,..,,,.,..,,,. BGHGDFHGAGGGGHGGGG/!GGHGHG/HHGDGHGAGEGDH


    and here's a library where it's not:

    gi|448814763|ref|NC_000962.3| 23610 C 25 ,$,.......,,,.,,.,,,,,.,,, AGGGGGGCGGGGGGGGGG/GGCFDG
    gi|448814763|ref|NC_000962.3| 23611 G 24 ,.......,,,A,,.,,,,a.,,, GGGGGGCGGGG9GGGGE/G9GGBG
    gi|448814763|ref|NC_000962.3| 23612 G 24 ,.......,,,.,,.,,,,,.,,, BGGGGG/GGGG9GGGGF/G9GDGG
    gi|448814763|ref|NC_000962.3| 23613 C 24 ,.......,,,.,,.,,,,,.,,, BGGGGGBGGGGGGGGG@CGFG/CG
    gi|448814763|ref|NC_000962.3| 23614 G 24 ,.......,,,.,,.,,,,,.,,, FGGGGGCFHHHGFHGHGHHHGFDH
    gi|448814763|ref|NC_000962.3| 23615 C 25 ,.......,,,.,,.,,,,,.,,,^], BGGGGGGGHHHGHHGFFFHHG4FHE
    gi|448814763|ref|NC_000962.3| 23616 A 25 ,.......,,,.,,.,,,,,.,,,, FGGGGGCGHHHGHHGHHHHHGHFHG
    gi|448814763|ref|NC_000962.3| 23617 T 25 ,.......,,,.,,.,,,,,.,,,, FGGGGG?GHHHGHHGGHHHHGHHHB
    gi|448814763|ref|NC_000962.3| 23618 C 25 ,.......,,,.,,.,,,,,.,,,, FGGGGG@GGHHGHHGGHGHHGGFHH
    gi|448814763|ref|NC_000962.3| 23619 A 25 ,.......,,,.,,.,,,,,.,,,, BHGHHHFHHHHHHHHHHGHHGHHHH
    gi|448814763|ref|NC_000962.3| 23620 T 25 ,.......,,,.,,.,,,,,.,,,, BHHHHHFHGGGFGGHGGEGGGFDGG
    gi|448814763|ref|NC_000962.3| 23621 G 25 ,.......,,,.,,.,,,,,.,,,, BHHHHHHHGGGHGGHGG?GGHG@GG
    gi|448814763|ref|NC_000962.3| 23622 A 25 ,.......,,,.,,.,,,,,.,,,, BHHHHHHHGGGHGGHGGEGGHGGGG
    gi|448814763|ref|NC_000962.3| 23623 C 25 ,$.......,,,.,,.,,,,,.,,,, >HHHHHHHGGGHGGHGGFGGHGGGG
    gi|448814763|ref|NC_000962.3| 23624 C 24 .......,,,.,,.,,,,,.,,,, HHHHHHHGGGHGGHGGGGGHGGGG
    gi|448814763|ref|NC_000962.3| 23625 G 24 .......,,,.,,.,,,,,.,,,, GGGGGGGGGGGGGGGGGGGGGGGG

    I am seeing this "local drop" in many of the discordant SNPs. Any idea why this might be happening? And why it might be bleeding over to neighboring positions? Is this cause for alarm that these positions might be false positives?Any other suggestions for sorting out these discordant positions?

    Thanks! Any thoughts are much appreciated!

  • #2
    I don't fully understand what mpileup is doing, but I found that the drop in quality around SNPs is due to the BAQ calculation meant to reduce the calling of SNPs around indels. So I use -Q in the command line to turn off base alignment quality calculations (see BAQ in http://samtools.sourceforge.net/mpileup.shtml) and after that I didn't see the drop. I guess I should be extra careful about SNPs near indels by doing that, though.
    Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Techniques and Challenges in Conservation Genomics
      by seqadmin



      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

      Avian Conservation
      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
      03-08-2024, 10:41 AM
    • seqadmin
      The Impact of AI in Genomic Medicine
      by seqadmin



      Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
      02-26-2024, 02:07 PM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, 03-14-2024, 06:13 AM
    0 responses
    32 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 03-08-2024, 08:03 AM
    0 responses
    71 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 03-07-2024, 08:13 AM
    0 responses
    80 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 03-06-2024, 09:51 AM
    0 responses
    68 views
    0 likes
    Last Post seqadmin  
    Working...
    X