Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • What about mutations in the "twilight zone"?

    Dear fellow NGSers,

    From what I've seen so far, I am able to use available tools (samtools, gatk, etc) to find SNVs and INDELs. For example:


    TGACTTGCTGA Reference
    TGACTCGCTGA Read 1
    TGACTCGCTGA Read 2 etc..



    TGACTTGCTGA Reference
    TGACT---TGA Read 1
    TGACT---TGA Read 2 etc..


    Regarding the detection of mutations, how does one handle mutations that are not Single Nucleotide Variants and that are not Insertion/Deletions (Indels) ?

    What if, for example, you have a two neighboring SNV mutations detected inside your reads?


    AGACTAGATCA Reference
    AGACACGATCA Read 1
    AGACACGATCA Read 2 etc..


    Are these recorded in samtools pileup as two seperate SNVs? or can they be detected as belonging together?

    Or what about a deletion of sequence and the insertion of new sequence?


    AGACTAGA-TCA Reference
    AGAGATAAGTCA Read 1
    AGAGATAAGTCA Read 2 etc..


    It seems to me that most of the tools out there are can handle identifying the simple SNV/indel scenarios but do not take into account such cases. Does samtools pileup capture these kinds of mutations?

    Perhaps my assumption is wrong and some of the available tools handle them?

    Thanks for any input.

  • #2
    New paper out (not yet in Medline! -- that's how new it is) addresses this issue but does appear to contain specific code


    Nucleic Acids Research, doi:10.1093/nar/gkq408

    Novel multi-nucleotide polymorphisms in the human genome characterized by whole genome and exome sequencing

    Jeffrey A. Rosenfeld1,2,*, Anil K. Malhotra1,2,3 and Todd Lencz1,2,3

    Genomic sequence comparisons between individuals are usually restricted to the analysis of single nucleotide polymorphisms (SNPs). While the interrogation of SNPs is efficient, they are not the only form of divergence between genomes. In this report, we expand the scope of polymorphism detection by investigating the occurrence of double nucleotide polymorphisms (DNPs) and triple nucleotide polymorphisms (TNPs), in which two or three consecutive nucleotides are altered compared to the reference sequence. We have found such DNPs and TNPs throughout two complete genomes and eight exomes. Within exons, these novel polymorphisms are over-represented amongst protein-altering variants; nearly all DNPs and TNPs result in a change in amino acid sequence and, in some cases, two adjacent amino acids are changed. DNPs and TNPs represent a potentially important new source of genetic variation which may underlie human disease and they should be included in future medical genetics studies. As a confirmation of the damaging nature of xNPs, we have identified changes in the exome of a glioblastoma cell line that are important in glioblastoma pathogenesis. We have found a TNP causing a single amino acid change in LAMC2 and a TNP causing a truncation of HUWE1.

    Comment


    • #3
      I think there's two subquestions there; will an ailgner align reads that are more disparate than a single base change, and what will a variant parser make of them?

      DNPs will probably be fine, even TNPs if your aligner handles 3 mismatches in a read. And I don't see why a variant parser would have a hard time with that.

      Your last example is the hard one, as most aligners just wouldn't align reads with a 5 base discrepancy. What you'd see is a steep drop off in coverage just over the change, possibly with the edges of the discrepancy called as SNPs, as some reads will land in exactly the right place that they just cover it, and will align with only 1 or 2 discrepancies at the end of the read. In theory, if you fixed your reference genome to match at those two letters, and then realigned, you'd get more reads aligning, and maybe you'd cover the whole region with reads after enough iterations. But aligning a second time probably isn't feasable for many genomes.

      De novo would catch those kinds of things, if your sample was mostly clonal or homozygous for that large change. Compare your de novo to your reference, and you'd see the discrepancy fine.

      Comment


      • #4
        krobison:

        wow! thanks a lot for sharing this paper with me - this is definitely hot off the press and on topic!

        swbarnes2:

        >I think there's two subquestions there; will an ailgner align reads that are >more disparate than a single base change, and what will a variant parser >make of them?

        Yes this is true - there are two parts to detection - alignment and variant parser. I would think that new aligners such as BWA/BFAST/Novoalign can handle mismatches and indels >3bp . Bowtie maxes out at 3bp.

        >DNPs will probably be fine, even TNPs if your aligner handles 3 mismatches in >a read. And I don't see why a variant parser would have a hard time with that.

        The variant parser is really where I am concerned - because the pileup output from samtools, looks like neighboring SNVs will get treated separately than as being together. The point is that if your short reads capture two SNV in one read-span length, then you can assign these two mutations as going together into a allele. In heterzygous situations, treating them separately could mean that one allele has mutation 1 and the other allele has mutation 2

        Please correct me if my genetic vocabulary use is wrong.

        Thanks for joining the discussion and sharing your input. It's great to bounce off ideas and hear back from others
        Last edited by NGSfan; 05-25-2010, 12:12 AM.

        Comment


        • #5
          Originally posted by NGSfan View Post
          What if, for example, you have a two neighboring SNV mutations detected inside your reads?

          ...

          Or what about a deletion of sequence and the insertion of new sequence?

          ...

          It seems to me that most of the tools out there are can handle identifying the simple SNV/indel scenarios but do not take into account such cases. Does samtools pileup capture these kinds of mutations?

          Perhaps my assumption is wrong and some of the available tools handle them?

          Thanks for any input.
          In pure sequence terms, I don't think there is a difference between two SNVs right next to each other and a complex indel where two neighboring bases are removed and replaced with two other bases. Those two events will look identical when two sequences are side by side.

          I believe many aligners and at least samtools for variant calling are indeed robust to these types of events and that they will usually mark them as indels because they do not necessarily constrain indels to a particular size, but they likely do constrain SNVs to one base (since they are, after all, single nucleotide variants). I guess that if a variant caller sees a spot where two expected bases in a row are missing, it flags that spot as a deletion, and if it sees a spot where two unexpected bases are present, it flags that as an insertion.

          Therefore, it seems reasonable that such an events will be flagged as a deletion and an insertion directly adjacent to each other. (In fact, in the back of my mind, I feel like I've seen that very type of thing before in our own whole genome alignments... maybe just in aberrant reads, though.)

          As for the indel example, as long as your aligner is robust against that (gapped aligners should be), that spot will similarly be flagged as both a deletion and an insertion adjacent to each other.

          Also, for the case where there are repetitive elements that make the exact position of that sort of event ambiguous, I believe people generally either left-justify them or randomly position them.
          Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
          Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
          Projects: U87MG whole genome sequence [Website] [Paper]

          Comment


          • #6
            Originally posted by Michael.James.Clark View Post
            In pure sequence terms, I don't think there is a difference between two SNVs right next to each other and a complex indel where two neighboring bases are removed and replaced with two other bases. Those two events will look identical when two sequences are side by side.

            I believe many aligners and at least samtools for variant calling are indeed robust to these types of events and that they will usually mark them as indels because they do not necessarily constrain indels to a particular size, but they likely do constrain SNVs to one base (since they are, after all, single nucleotide variants). I guess that if a variant caller sees a spot where two expected bases in a row are missing, it flags that spot as a deletion, and if it sees a spot where two unexpected bases are present, it flags that as an insertion.
            You're right that in pure sequence terms, it will not make a difference, since you are just recording changes. But it will make a difference perhaps, when you want to distinguish alleles:

            The following would get reported as g.6T>C and g.7T>G:


            TGACTTTGCTGA Reference
            TGACTCTGCTGA Read 1
            TGACTCTGCTGA Read 2
            TGACTCTGCTGA Read 3
            TGACTTGGCTGA Read 4
            TGACTTGGCTGA Read 5
            TGACTTGGCTGA Read 6 etc..



            And if my understanding of samtools pileup is correct, so would this case:


            TGACTTTGCTGA Reference
            TGACTCGGCTGA Read 1
            TGACTCGGCTGA Read 2
            TGACTCGGCTGA Read 3
            TGACTTTGCTGA Read 4
            TGACTTTGCTGA Read 5
            TGACTTTGCTGA Read 6 etc..
            etc..



            So while, both are recorded as g.6T>C and g.7T>G at the end of the day, the problem is that they are really different kind of mutation, one from the other. However one alignment is telling you that an allele carries both, while the other tells you there are two alleles each carrying a different mutation. I think it is important to distinguish this, no?


            Originally posted by Michael.James.Clark View Post
            Therefore, it seems reasonable that such an events will be flagged as a deletion and an insertion directly adjacent to each other. (In fact, in the back of my mind, I feel like I've seen that very type of thing before in our own whole genome alignments... maybe just in aberrant reads, though.)

            As for the indel example, as long as your aligner is robust against that (gapped aligners should be), that spot will similarly be flagged as both a deletion and an insertion adjacent to each other.

            Also, for the case where there are repetitive elements that make the exact position of that sort of event ambiguous, I believe people generally either left-justify them or randomly position them.
            These are definitely difficult alignment situations - because it deals with two events first a deletion, then an insertion. I am using BFAST, which for the most part handles indels pretty well. But just thinking of scenarios where the change is not just an Deletion *or* an Insertion but where both happened.

            Comment


            • #7
              Originally posted by NGSfan View Post
              You're right that in pure sequence terms, it will not make a difference, since you are just recording changes. But it will make a difference perhaps, when you want to distinguish alleles:

              The following would get reported as g.6T>C and g.7T>G:


              TGACTTTGCTGA Reference
              TGACTCTGCTGA Read 1
              TGACTCTGCTGA Read 2
              TGACTCTGCTGA Read 3
              TGACTTGGCTGA Read 4
              TGACTTGGCTGA Read 5
              TGACTTGGCTGA Read 6 etc..



              And if my understanding of samtools pileup is correct, so would this case:


              TGACTTTGCTGA Reference
              TGACTCGGCTGA Read 1
              TGACTCGGCTGA Read 2
              TGACTCGGCTGA Read 3
              TGACTTTGCTGA Read 4
              TGACTTTGCTGA Read 5
              TGACTTTGCTGA Read 6 etc..
              etc..



              So while, both are recorded as g.6T>C and g.7T>G at the end of the day, the problem is that they are really different kind of mutation, one from the other. However one alignment is telling you that an allele carries both, while the other tells you there are two alleles each carrying a different mutation. I think it is important to distinguish this, no?
              But those aren't the same by sequence because they aren't occurring on the same haplotype. I doubt it would be reported as the same type of event because the first case should be called as two adjacent SNVs since they're happening on separate haplotypes while the second one is a deletion adjacent to an insertion because it's happening on the same haplotype.

              These are definitely difficult alignment situations - because it deals with two events first a deletion, then an insertion. I am using BFAST, which for the most part handles indels pretty well. But just thinking of scenarios where the change is not just an Deletion *or* an Insertion but where both happened.
              Like I said, in the back of my mind I recall seeing reads like this without a problem, and that's using BFAST. I'm not actually sure about samtools calling a variant like this because I don't recall seeing a variant like this (I think the closest I've seen is a deletion adjacent to a SNV). I encourage you to test it with a simulation if you're concerned with it, though.
              Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
              Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
              Projects: U87MG whole genome sequence [Website] [Paper]

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Essential Discoveries and Tools in Epitranscriptomics
                by seqadmin




                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                04-22-2024, 07:01 AM
              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Yesterday, 08:47 AM
              0 responses
              12 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              60 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              60 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              54 views
              0 likes
              Last Post seqadmin  
              Working...
              X