Hello,
I am writing regarding a decidedly strange phenomenon I have noticed when calling SNPs using the SAMtools Pileup command. I followed the exact instructions here (omitting the -v option because I wasn't sure what it did), included the exact awk command written, and got the output listed below. Apologies for the long paste, and the markup:
gi|57116681|ref|NC_000962.2| 333892 G C 36 36 60 3 CCC ;CG
#PGRS protein
gi|57116681|ref|NC_000962.2| 337959 A C 51 51 60 8 C$CcCCcC^}C #G>GF#GC
#PGRS protein
gi|57116681|ref|NC_000962.2| 338020 A C 57 57 59 10 CCCCcccCcc #GHH###G##
#same as above
gi|57116681|ref|NC_000962.2| 338100 T C 36 36 59 3 cCC FF;
#same as above
gi|57116681|ref|NC_000962.2| 561484 G S 42 42 60 30 .$.,,..............cc.c.cc.c.^~.^~c BBFGAHHIHBGIIHIGEHGEH#HGDH?GCA
#no explanation, should be low confidence
gi|57116681|ref|NC_000962.2| 565633 T C 184 184 60 52 C$c$CCcccCcCcCccCCCccCcCCcCcCccCCccCcccCcCCcCCcCcccccc BCBCGHIGHHHHHGCHGHIDHGHFHHHIIHHIFGIIIGIGGIGGFG9H#GCA
#GENUINE SNP
gi|57116681|ref|NC_000962.2| 635633 C T 165 165 59 46 T$tTTtTTTtTtttTtTtTTtTtTttTTtTTTttTTTttTtTttTTt 7<;==>>>>#=8>;==8=<7>>;=>==>;===>=6=>=<>=>==<8
#GENUINE SNP
gi|57116681|ref|NC_000962.2| 636924 C S 27 27 60 21 ,$.......g.....g.gggg^~g >BEHHHGG#GG>GG8H#BD7:
#no explanation, should be low confidence
gi|57116681|ref|NC_000962.2| 672491 C G 36 36 60 3 ggG GGI
#PGRS protein
gi|57116681|ref|NC_000962.2| 836426 A C 36 36 60 3 c$C^~C CHC
#PGRS protein
gi|57116681|ref|NC_000962.2| 836454 A G 45 45 60 6 GGgGGg @?-?BC
#same as above
gi|57116681|ref|NC_000962.2| 839194 A G 36 36 60 3 ggg G##
#PGRS protein
gi|57116681|ref|NC_000962.2| 840496 C G 36 36 60 3 ggg GGH
#same as above
gi|57116681|ref|NC_000962.2| 1131186 C S 24 24 60 23 G,,GGGG,G,GGGG,GG,,,,,. #HI#A/<G#H##@#I3+IFHFHF
#no explanation
gi|57116681|ref|NC_000962.2| 1168718 C T 13 42 60 6 T$t$ttT, 6C=>@A
#occurs near a known indel
gi|57116681|ref|NC_000962.2| 1168720 G S 33 33 60 5 cC,.^b. GFEGC
#same as above
gi|57116681|ref|NC_000962.2| 1175022 G S 27 27 60 19 ............cc...c^~c #EHBIIDIIGHHEGHII??
#no explanation, should be low confidence
gi|57116681|ref|NC_000962.2| 1266137 C M 24 24 59 24 AAA,A,,A,AAA,,A,,,AA,,., ;##G#GHBG##@EG:HHH3#GDHB
#no explanation
gi|57116681|ref|NC_000962.2| 1408994 T K 42 42 59 41 ,$,,GG,G,.,G,GGG,GG,,G,G,,,G,G,.,,G,,,,.,^~, C<=A;=8=?=9:4=#>%@>=C>C><=#<7<>>>,>====<7
#no explanation
gi|57116681|ref|NC_000962.2| 1532827 A W 43 43 52 56 t$.tT,.,TTt..,T,TT.,....t...TT..,..,,,,.,,,,....,,.....,^~. C4<<?<;@@>=;?A?@>>?>?>>=><;A?==@<;#@@A=@A@@=>==@@>===<:C
#PPE protein, should be low confidence
gi|57116681|ref|NC_000962.2| 1533434 A R 26 26 35 47 ,$,.,..,,,,....,..,.g.gG.,.G,.gggG.cgc,c.g.g.g^?G^?g @=>>#==9<=5>>><>>:>#<#F=#=H>>>A#H=#####=#=#<#C#
#same as above
gi|57116681|ref|NC_000962.2| 2153410 A G 159 159 59 44 G$gGGgGGGgggGggGggggGgGggGGgGGgGGGggGgGGGgGgg ;F@AGFCAFFF=FGFGFGGFGFGE>EEEEGFEFGHEHE;EGFF#
#GENUINE SNP
gi|57116681|ref|NC_000962.2| 2748136 G S 36 36 60 29 .$.,....A...........c.c.c..ccc 8#FBF##>1#@=@#EF@HH#HFHDHIC?C
#no explanation, should be low confidence
gi|57116681|ref|NC_000962.2| 3054724 A G 39 39 60 4 GgGg #H##
#PGRS protein
gi|57116681|ref|NC_000962.2| 3091658 C S 51 51 60 31 .$..,g......g...g.ggg.g.g..ggg.g BABH,FEGGGF=GGGEF8D-F-F4GG?B#F#
#no explanation
gi|57116681|ref|NC_000962.2| 3120431 A G 45 45 60 6 GGGg^~G^~g FDD?C<
#no explanation
gi|57116681|ref|NC_000962.2| 3248308 A C 2 33 60 3 c$c^}. CGC
#no explanation
gi|57116681|ref|NC_000962.2| 3379763 G A 22 22 35 12 .$,,,...aaAA^wa ;GGGIIG##>=#
#PPE protein
gi|57116681|ref|NC_000962.2| 3839278 T K 33 33 58 24 ,GG,G,GGG,GG,.,G,G,,G,., >#B=D>##3=0@?#?4=#>>#>==
#no explanation
gi|57116681|ref|NC_000962.2| 3934733 G C 25 33 35 6 ,...cc F###IB
#PGRS protein
gi|57116681|ref|NC_000962.2| 3934734 G A 28 30 35 6 ,$C..aa C###>=
#same as above
gi|57116681|ref|NC_000962.2| 4095002 G A 2 21 56 3 Aa^~C <=C
#occurs near a known indel
gi|57116681|ref|NC_000962.2| 4400663 C G 5 33 60 3 gG, EA@
#occurs near a known indel
gi|57116681|ref|NC_000962.2| 4400664 G C 2 30 60 3 cC, E=C
#same as above
Most of the SNPs that have been called are in repetitive regions, referred to here as PGRS or PPE proteins; these are not accurate and we are not worried about them. Several genuine SNPs occurred; I have labeled those as such.
The intriguing SNPs are the ones that I have labeled "no explanation". We do not expect that these are SNPs, but we don't know what to make of them. As you will notice, nearly all of them feature a phenomenon where all of the reads in one direction match the reference, and all the reads in the other direction feature a SNP, resulting in an appearance of diploidy or polyclonality. We are fairly sure this is not the case, and as far as we can tell, these SNPs are not in repetitive regions or regions that should be giving the aligner difficulty.
Is there any precedent for this phenomenon? I would be happy to provide more data if it is of interest.
As a closing note, I would like to observe that SAMtools - and Maq before it - have been greatly helpful to our lab, and that, even in the sequencing runs above, we found our answer; we are merely curious about the phenomenon noted above.
Regards,
Carl Wivagg
I am writing regarding a decidedly strange phenomenon I have noticed when calling SNPs using the SAMtools Pileup command. I followed the exact instructions here (omitting the -v option because I wasn't sure what it did), included the exact awk command written, and got the output listed below. Apologies for the long paste, and the markup:
gi|57116681|ref|NC_000962.2| 333892 G C 36 36 60 3 CCC ;CG
#PGRS protein
gi|57116681|ref|NC_000962.2| 337959 A C 51 51 60 8 C$CcCCcC^}C #G>GF#GC
#PGRS protein
gi|57116681|ref|NC_000962.2| 338020 A C 57 57 59 10 CCCCcccCcc #GHH###G##
#same as above
gi|57116681|ref|NC_000962.2| 338100 T C 36 36 59 3 cCC FF;
#same as above
gi|57116681|ref|NC_000962.2| 561484 G S 42 42 60 30 .$.,,..............cc.c.cc.c.^~.^~c BBFGAHHIHBGIIHIGEHGEH#HGDH?GCA
#no explanation, should be low confidence
gi|57116681|ref|NC_000962.2| 565633 T C 184 184 60 52 C$c$CCcccCcCcCccCCCccCcCCcCcCccCCccCcccCcCCcCCcCcccccc BCBCGHIGHHHHHGCHGHIDHGHFHHHIIHHIFGIIIGIGGIGGFG9H#GCA
#GENUINE SNP
gi|57116681|ref|NC_000962.2| 635633 C T 165 165 59 46 T$tTTtTTTtTtttTtTtTTtTtTttTTtTTTttTTTttTtTttTTt 7<;==>>>>#=8>;==8=<7>>;=>==>;===>=6=>=<>=>==<8
#GENUINE SNP
gi|57116681|ref|NC_000962.2| 636924 C S 27 27 60 21 ,$.......g.....g.gggg^~g >BEHHHGG#GG>GG8H#BD7:
#no explanation, should be low confidence
gi|57116681|ref|NC_000962.2| 672491 C G 36 36 60 3 ggG GGI
#PGRS protein
gi|57116681|ref|NC_000962.2| 836426 A C 36 36 60 3 c$C^~C CHC
#PGRS protein
gi|57116681|ref|NC_000962.2| 836454 A G 45 45 60 6 GGgGGg @?-?BC
#same as above
gi|57116681|ref|NC_000962.2| 839194 A G 36 36 60 3 ggg G##
#PGRS protein
gi|57116681|ref|NC_000962.2| 840496 C G 36 36 60 3 ggg GGH
#same as above
gi|57116681|ref|NC_000962.2| 1131186 C S 24 24 60 23 G,,GGGG,G,GGGG,GG,,,,,. #HI#A/<G#H##@#I3+IFHFHF
#no explanation
gi|57116681|ref|NC_000962.2| 1168718 C T 13 42 60 6 T$t$ttT, 6C=>@A
#occurs near a known indel
gi|57116681|ref|NC_000962.2| 1168720 G S 33 33 60 5 cC,.^b. GFEGC
#same as above
gi|57116681|ref|NC_000962.2| 1175022 G S 27 27 60 19 ............cc...c^~c #EHBIIDIIGHHEGHII??
#no explanation, should be low confidence
gi|57116681|ref|NC_000962.2| 1266137 C M 24 24 59 24 AAA,A,,A,AAA,,A,,,AA,,., ;##G#GHBG##@EG:HHH3#GDHB
#no explanation
gi|57116681|ref|NC_000962.2| 1408994 T K 42 42 59 41 ,$,,GG,G,.,G,GGG,GG,,G,G,,,G,G,.,,G,,,,.,^~, C<=A;=8=?=9:4=#>%@>=C>C><=#<7<>>>,>====<7
#no explanation
gi|57116681|ref|NC_000962.2| 1532827 A W 43 43 52 56 t$.tT,.,TTt..,T,TT.,....t...TT..,..,,,,.,,,,....,,.....,^~. C4<<?<;@@>=;?A?@>>?>?>>=><;A?==@<;#@@A=@A@@=>==@@>===<:C
#PPE protein, should be low confidence
gi|57116681|ref|NC_000962.2| 1533434 A R 26 26 35 47 ,$,.,..,,,,....,..,.g.gG.,.G,.gggG.cgc,c.g.g.g^?G^?g @=>>#==9<=5>>><>>:>#<#F=#=H>>>A#H=#####=#=#<#C#
#same as above
gi|57116681|ref|NC_000962.2| 2153410 A G 159 159 59 44 G$gGGgGGGgggGggGggggGgGggGGgGGgGGGggGgGGGgGgg ;F@AGFCAFFF=FGFGFGGFGFGE>EEEEGFEFGHEHE;EGFF#
#GENUINE SNP
gi|57116681|ref|NC_000962.2| 2748136 G S 36 36 60 29 .$.,....A...........c.c.c..ccc 8#FBF##>1#@=@#EF@HH#HFHDHIC?C
#no explanation, should be low confidence
gi|57116681|ref|NC_000962.2| 3054724 A G 39 39 60 4 GgGg #H##
#PGRS protein
gi|57116681|ref|NC_000962.2| 3091658 C S 51 51 60 31 .$..,g......g...g.ggg.g.g..ggg.g BABH,FEGGGF=GGGEF8D-F-F4GG?B#F#
#no explanation
gi|57116681|ref|NC_000962.2| 3120431 A G 45 45 60 6 GGGg^~G^~g FDD?C<
#no explanation
gi|57116681|ref|NC_000962.2| 3248308 A C 2 33 60 3 c$c^}. CGC
#no explanation
gi|57116681|ref|NC_000962.2| 3379763 G A 22 22 35 12 .$,,,...aaAA^wa ;GGGIIG##>=#
#PPE protein
gi|57116681|ref|NC_000962.2| 3839278 T K 33 33 58 24 ,GG,G,GGG,GG,.,G,G,,G,., >#B=D>##3=0@?#?4=#>>#>==
#no explanation
gi|57116681|ref|NC_000962.2| 3934733 G C 25 33 35 6 ,...cc F###IB
#PGRS protein
gi|57116681|ref|NC_000962.2| 3934734 G A 28 30 35 6 ,$C..aa C###>=
#same as above
gi|57116681|ref|NC_000962.2| 4095002 G A 2 21 56 3 Aa^~C <=C
#occurs near a known indel
gi|57116681|ref|NC_000962.2| 4400663 C G 5 33 60 3 gG, EA@
#occurs near a known indel
gi|57116681|ref|NC_000962.2| 4400664 G C 2 30 60 3 cC, E=C
#same as above
Most of the SNPs that have been called are in repetitive regions, referred to here as PGRS or PPE proteins; these are not accurate and we are not worried about them. Several genuine SNPs occurred; I have labeled those as such.
The intriguing SNPs are the ones that I have labeled "no explanation". We do not expect that these are SNPs, but we don't know what to make of them. As you will notice, nearly all of them feature a phenomenon where all of the reads in one direction match the reference, and all the reads in the other direction feature a SNP, resulting in an appearance of diploidy or polyclonality. We are fairly sure this is not the case, and as far as we can tell, these SNPs are not in repetitive regions or regions that should be giving the aligner difficulty.
Is there any precedent for this phenomenon? I would be happy to provide more data if it is of interest.
As a closing note, I would like to observe that SAMtools - and Maq before it - have been greatly helpful to our lab, and that, even in the sequencing runs above, we found our answer; we are merely curious about the phenomenon noted above.
Regards,
Carl Wivagg
Comment