Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • SAMtools Pileup Strand Weirdness

    Hello,

    I am writing regarding a decidedly strange phenomenon I have noticed when calling SNPs using the SAMtools Pileup command. I followed the exact instructions here (omitting the -v option because I wasn't sure what it did), included the exact awk command written, and got the output listed below. Apologies for the long paste, and the markup:

    gi|57116681|ref|NC_000962.2| 333892 G C 36 36 60 3 CCC ;CG
    #PGRS protein
    gi|57116681|ref|NC_000962.2| 337959 A C 51 51 60 8 C$CcCCcC^}C #G>GF#GC
    #PGRS protein
    gi|57116681|ref|NC_000962.2| 338020 A C 57 57 59 10 CCCCcccCcc #GHH###G##
    #same as above
    gi|57116681|ref|NC_000962.2| 338100 T C 36 36 59 3 cCC FF;
    #same as above
    gi|57116681|ref|NC_000962.2| 561484 G S 42 42 60 30 .$.,,..............cc.c.cc.c.^~.^~c BBFGAHHIHBGIIHIGEHGEH#HGDH?GCA
    #no explanation, should be low confidence
    gi|57116681|ref|NC_000962.2| 565633 T C 184 184 60 52 C$c$CCcccCcCcCccCCCccCcCCcCcCccCCccCcccCcCCcCCcCcccccc BCBCGHIGHHHHHGCHGHIDHGHFHHHIIHHIFGIIIGIGGIGGFG9H#GCA
    #GENUINE SNP
    gi|57116681|ref|NC_000962.2| 635633 C T 165 165 59 46 T$tTTtTTTtTtttTtTtTTtTtTttTTtTTTttTTTttTtTttTTt 7<;==>>>>#=8>;==8=<7>>;=>==>;===>=6=>=<>=>==<8
    #GENUINE SNP
    gi|57116681|ref|NC_000962.2| 636924 C S 27 27 60 21 ,$.......g.....g.gggg^~g >BEHHHGG#GG>GG8H#BD7:
    #no explanation, should be low confidence
    gi|57116681|ref|NC_000962.2| 672491 C G 36 36 60 3 ggG GGI
    #PGRS protein
    gi|57116681|ref|NC_000962.2| 836426 A C 36 36 60 3 c$C^~C CHC
    #PGRS protein
    gi|57116681|ref|NC_000962.2| 836454 A G 45 45 60 6 GGgGGg @?-?BC
    #same as above
    gi|57116681|ref|NC_000962.2| 839194 A G 36 36 60 3 ggg G##
    #PGRS protein
    gi|57116681|ref|NC_000962.2| 840496 C G 36 36 60 3 ggg GGH
    #same as above
    gi|57116681|ref|NC_000962.2| 1131186 C S 24 24 60 23 G,,GGGG,G,GGGG,GG,,,,,. #HI#A/<G#H##@#I3+IFHFHF
    #no explanation
    gi|57116681|ref|NC_000962.2| 1168718 C T 13 42 60 6 T$t$ttT, 6C=>@A
    #occurs near a known indel
    gi|57116681|ref|NC_000962.2| 1168720 G S 33 33 60 5 cC,.^b. GFEGC
    #same as above
    gi|57116681|ref|NC_000962.2| 1175022 G S 27 27 60 19 ............cc...c^~c #EHBIIDIIGHHEGHII??
    #no explanation, should be low confidence
    gi|57116681|ref|NC_000962.2| 1266137 C M 24 24 59 24 AAA,A,,A,AAA,,A,,,AA,,., ;##G#GHBG##@EG:HHH3#GDHB
    #no explanation
    gi|57116681|ref|NC_000962.2| 1408994 T K 42 42 59 41 ,$,,GG,G,.,G,GGG,GG,,G,G,,,G,G,.,,G,,,,.,^~, C<=A;=8=?=9:4=#>%@>=C>C><=#<7<>>>,>====<7
    #no explanation
    gi|57116681|ref|NC_000962.2| 1532827 A W 43 43 52 56 t$.tT,.,TTt..,T,TT.,....t...TT..,..,,,,.,,,,....,,.....,^~. C4<<?<;@@>=;?A?@>>?>?>>=><;A?==@<;#@@A=@A@@=>==@@>===<:C
    #PPE protein, should be low confidence
    gi|57116681|ref|NC_000962.2| 1533434 A R 26 26 35 47 ,$,.,..,,,,....,..,.g.gG.,.G,.gggG.cgc,c.g.g.g^?G^?g @=>>#==9<=5>>><>>:>#<#F=#=H>>>A#H=#####=#=#<#C#
    #same as above
    gi|57116681|ref|NC_000962.2| 2153410 A G 159 159 59 44 G$gGGgGGGgggGggGggggGgGggGGgGGgGGGggGgGGGgGgg ;F@AGFCAFFF=FGFGFGGFGFGE>EEEEGFEFGHEHE;EGFF#
    #GENUINE SNP
    gi|57116681|ref|NC_000962.2| 2748136 G S 36 36 60 29 .$.,....A...........c.c.c..ccc 8#FBF##>1#@=@#EF@HH#HFHDHIC?C
    #no explanation, should be low confidence
    gi|57116681|ref|NC_000962.2| 3054724 A G 39 39 60 4 GgGg #H##
    #PGRS protein
    gi|57116681|ref|NC_000962.2| 3091658 C S 51 51 60 31 .$..,g......g...g.ggg.g.g..ggg.g BABH,FEGGGF=GGGEF8D-F-F4GG?B#F#
    #no explanation
    gi|57116681|ref|NC_000962.2| 3120431 A G 45 45 60 6 GGGg^~G^~g FDD?C<
    #no explanation
    gi|57116681|ref|NC_000962.2| 3248308 A C 2 33 60 3 c$c^}. CGC
    #no explanation
    gi|57116681|ref|NC_000962.2| 3379763 G A 22 22 35 12 .$,,,...aaAA^wa ;GGGIIG##>=#
    #PPE protein
    gi|57116681|ref|NC_000962.2| 3839278 T K 33 33 58 24 ,GG,G,GGG,GG,.,G,G,,G,., >#B=D>##3=0@?#?4=#>>#>==
    #no explanation
    gi|57116681|ref|NC_000962.2| 3934733 G C 25 33 35 6 ,...cc F###IB
    #PGRS protein
    gi|57116681|ref|NC_000962.2| 3934734 G A 28 30 35 6 ,$C..aa C###>=
    #same as above
    gi|57116681|ref|NC_000962.2| 4095002 G A 2 21 56 3 Aa^~C <=C
    #occurs near a known indel
    gi|57116681|ref|NC_000962.2| 4400663 C G 5 33 60 3 gG, EA@
    #occurs near a known indel
    gi|57116681|ref|NC_000962.2| 4400664 G C 2 30 60 3 cC, E=C
    #same as above

    Most of the SNPs that have been called are in repetitive regions, referred to here as PGRS or PPE proteins; these are not accurate and we are not worried about them. Several genuine SNPs occurred; I have labeled those as such.

    The intriguing SNPs are the ones that I have labeled "no explanation". We do not expect that these are SNPs, but we don't know what to make of them. As you will notice, nearly all of them feature a phenomenon where all of the reads in one direction match the reference, and all the reads in the other direction feature a SNP, resulting in an appearance of diploidy or polyclonality. We are fairly sure this is not the case, and as far as we can tell, these SNPs are not in repetitive regions or regions that should be giving the aligner difficulty.

    Is there any precedent for this phenomenon? I would be happy to provide more data if it is of interest.

    As a closing note, I would like to observe that SAMtools - and Maq before it - have been greatly helpful to our lab, and that, even in the sequencing runs above, we found our answer; we are merely curious about the phenomenon noted above.

    Regards,
    Carl Wivagg

  • #2
    What aligner are you using? Samtools may work well with some but worse with others. Does it do gapped alignment? As for strand bias, people from Broad Institute found that when SNPs are supported from reads on one strand only, they tend to be wrong. This may be caused by duplicates, wrong alignments, context-related error dependency and other factors. You may perform a test to rule out such SNPs. In addition, if you have deep depth, you may consider to increase varFilter -d to require higher coverage on SNPs. The miscalls in repetitive proteins look also mysterious to me because the high mapping quality indicate the alignments look reliable. Were you aligning against the whole genome or targeted region only?

    Comment


    • #3
      This is using the Maq aligner, but using only single-end reads, so if I understand how Maq works correctly, that means it's not doing any gapped alignment? My apologies for omitting this critical piece of information.

      In any case, it's my strong suspicion that the SNPs are wrong, and you're right that turning up the varFilter or awking a bit differently would stop these SNPs from showing up. I just thought that the strandedness phenomenon was interesting, and if it had come up before or it was known what caused it.

      You raise a good point about the reads in repetitive regions; I guess if the mapping quality is good, I should be trusting the alignment. In response to your question, it was aligned against the whole genome... so I have no reason to think that it would align better somewhere else that the aligner just didn't see, because the mapping quality takes into account the whole genome. That having been said, the indicated SNP confidence is low (lower, at any rate) than for the other SNPs... I guess I don't know what to make of these regions either.

      Thank you for the quick response!

      Carl

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Essential Discoveries and Tools in Epitranscriptomics
        by seqadmin


        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
        Yesterday, 07:01 AM
      • seqadmin
        Current Approaches to Protein Sequencing
        by seqadmin


        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
        04-04-2024, 04:25 PM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, 04-11-2024, 12:08 PM
      0 responses
      55 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 10:19 PM
      0 responses
      51 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 09:21 AM
      0 responses
      45 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-04-2024, 09:00 AM
      0 responses
      55 views
      0 likes
      Last Post seqadmin  
      Working...
      X