Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Ultra-rare variant detection -- impossible?

    What if I want to know whether an oncogene has acquired a single base mutation in a tumor, even if it's only present in 1 of every 10,000 cells? What if I want to know the prevalence of single nucleotide transcription errors for a specific transcript, even errors present at 1 in 20,000?

    NGS technologies have an error rate of 1/1000 - 1/100 per base*read. Does this make the above two problems impossible to solve with NGS, even if I get millions of reads covering the regions in question?

    I wanted to get the forum's thoughts on the above, in addition to hearing whether there are any publications addressing this, as I've found few. Can we think of any workarounds, either informatic or wet? How should we set thresholds for the minimum frequency at which a variant would be confidently detectable?

    I look forward to anyone else's take,
    Genly

    (Wasn't sure what subforum to put this post in, so feel free to suggest another.)

  • #2
    Someone may have more intelligent things to say and suggest a informatics work around but I tend to agree you are limited to the error rate of the technology.

    For example say you have a 1/1000 error rate and you required three hits (error +2, say one read in each direction) at a position to detect a mutation and say it's a heterozygous KRAS mutant. That means in a 1000 reads in a diploid tumor you analyzed 500 cells and with a 2 read hit requirement you have 1/250 cells sensitivity. Still not bad but those are generous error rates and I doubt anyone would waste much time using such low stringency cutoffs so all the talk of "deep sequencing" with no knowledge of error rates seems very naive.

    Not sure if that helps as I pretty much agreed with you verbatim but I get tired of listening to people talk about "deep sequencing" who have not seen a sequencer, have not seen the raw data, and definitely don't know the error rates of the current machines...

    I look forward to seeing others opinions too

    Comment


    • #3
      There's a possible wet lab workaround that might go some of the way. It would work best with sequencing PCR products. You can split your sample into several very small pools. If you have something very rare then it will be present in only a portion of them. You can then do your PCR and make your libraries with tags on. There are a few clever tricks around like DNA sudoku http://hannonlab.cshl.edu/dna_sudoku/main.html which will allow very high levels of multiplexing. Then when you do your sequencing, if your SNP is real as opposed to being a sequencing error, all the reads with that SNP will be concentrated in just a few pools.

      Comment


      • #4
        I second henry.wood. If you sequence with very small pools to high coverage, it is in theory possible to find a SNP at the 0.01% frequency.

        Comment


        • #5
          Glad to see that I'm not alone in caring about this. Like JK said, the idea of sequencing one locus to significant depth was always one of the selling points of NGS, and it's frustrating to not be able to access that.

          Henry and lh3, that idea makes sense. So you're saying, let's say I want to be able to detect something present at 0.01%, and my "normal" threshold for calling a SNP significant is 5%. I would make tiny libraries such that my expected number of amplifiable fragments containing a given site is just 20 for each library, so if the rare allele is present in a lib, we should sequence it 5% of the time that site is sequenced. To get decent odds of seeing something at 0.01%, I would need to make and sequence ballpark [100% / (0.01% * 20) =] 500 such libs (but really a few fold more than that). Does this correctly summarize what you were envisioning?

          I'm a bit concerned about the DNA sudoku aspect here. Won't it only work as long as we have perfect sensitivity for our detectable event even once the libraries are multiplexed? So as soon as you pool a few libs, your SNP sensitivity for the pool is below the workable threshold. So it seems like we are going to need a very large number of libraries and sequencing pools to get this to work.

          Ok ... I started off enthusiastic there, and then convinced myself that it would be quite unwieldy. Am I thinking about this the wrong way? Let me know what you think.

          Comment


          • #6
            One thing I hope can make a difference is bi-directional reads on Ilumina.

            Being able to sequence in F&R over an amplicon should allow very much higher Qscores to be called.

            As seqeuncing read errors are probably higher than incorporation errors these would be greatly reduced following a bi-directional strategy. The major blocks for ultra-low detection are PCR errors at initial amplification and initial/early cycles of cluster generation. However these should be random and lower than the 0.1 or 0.01% you have mentioned.

            Comment


            • #7
              Originally posted by genlyai View Post
              To get decent odds of seeing something at 0.01%, I would need to make and sequence ballpark [100% / (0.01% * 20) =] 500 such libs (but really a few fold more than that)
              That's the kind of thing I was envisaging. You never said you wanted a cheap and simple solution I can't claim to have got to the end of the Sudoku paper without my head spinning slightly, so it may well not be what you're after. Good luck with all those libraries.

              Comment


              • #8
                Originally posted by james hadfield View Post
                One thing I hope can make a difference is bi-directional reads on Ilumina.

                Being able to sequence in F&R over an amplicon should allow very much higher Qscores to be called.

                As seqeuncing read errors are probably higher than incorporation errors these would be greatly reduced following a bi-directional strategy. The major blocks for ultra-low detection are PCR errors at initial amplification and initial/early cycles of cluster generation. However these should be random and lower than the 0.1 or 0.01% you have mentioned.
                Good point, and there may be some data out there to address this already.

                On the other hand, my intuition is the opposite of yours wrt the contribution of PCR errors. Taq is normally quoted as having an error rate of around 0.01%/base*cyc. After 10-12 cyc, this is in the ballpark of the error rate of the whole process.

                As I said, though, the data may well be out there to answer this without resorting to guesswork.

                Comment


                • #9
                  Originally posted by henry.wood View Post
                  That's the kind of thing I was envisaging. You never said you wanted a cheap and simple solution I can't claim to have got to the end of the Sudoku paper without my head spinning slightly, so it may well not be what you're after. Good luck with all those libraries.
                  To be fair, the 5% detection threshold could probably be lowered for a well-run process, but we are still talking about 100+ libraries. Doable, but far from simple.

                  Comment


                  • #10
                    Originally posted by genlyai View Post
                    On the other hand, my intuition is the opposite of yours wrt the contribution of PCR errors. Taq is normally quoted as having an error rate of around 0.01%/base*cyc. After 10-12 cyc, this is in the ballpark of the error rate of the whole process.

                    At least in the ChIP-seq library kits (the only one I have hands on experience with) the library prep PCR uses Phusion, which has a much lower error rate

                    Comment


                    • #11
                      Originally posted by frozenlyse View Post
                      At least in the ChIP-seq library kits (the only one I have hands on experience with) the library prep PCR uses Phusion, which has a much lower error rate
                      Good point. Is this the case with Illumina's reagent kits?

                      For that matter, is anyone aware of a study that tries to quantify error arising from each step of the process?

                      Comment


                      • #12
                        Nothing is impossible

                        I say not impossible, and not necessarily relyant on the error rate of sequencing, or of amplification. If you can make a sequencing library that is smart enough to overcome these obstacles, you can attain the currently unattainable. I realize I'm not telling you anything, but if I shared all my ideas maybe you wouldn't come up with different ones.

                        Comment


                        • #13
                          This paper should get the job done: http://www.pnas.org/content/108/23/9530.abstract

                          Comment


                          • #14
                            i would say most talk of the error rates of ngs grossly overemphasize the problem. yes there is a conflict rate of >>1% when you compare plain read sequences but as soon as you introduce any sort of quality filtering in your variant detection frequencies drop right down.

                            i agree thought that the error rate of the process will determine how "deep" we can see things but i would say 0.01% would not unrealistic with some tweaks.

                            we have done some sequencing of pcr amplified mtdna without doing anything as complex as the paper above where we assembled at 100,000-200,000 fold coverage. using just a medium stringency call quality filter during snp detection you can see a shift in average variant frequencies from 0.04% to 0.15% in control vs mice expressing error prone dna polymerase.

                            Obviously this does not in itself tell you a whole lot because we do not know how much of each value is error BUT it does demonstrate that even in a system as simple as this you can see some biology at this level of detection above that of the error rate of the overall process.

                            Comment


                            • #15
                              A similar paper came out from Sydney Brenner and friends around the same time as the Vogelstein PNAS paper cited above.

                              Amplification by polymerase chain reaction is often used in the preparation of template DNA molecules for next-generation sequencing. Amplification increases the number of available molecules for sequencing but changes the representation of the template molecules in the amplified product and introdu …

                              Nucleic Acids Res. 2011 Jul;39(12):e81. Epub 2011 Apr 13.
                              A method for counting PCR template molecules with application to next-generation sequencing.
                              Casbon JA, Osborne RJ, Brenner S, Lichtenstein CP.
                              Source
                              Population Genetics Technologies Ltd., Babraham Institute, Babraham, Cambridgeshire CB22 3AT, UK.

                              Abstract
                              Amplification by polymerase chain reaction is often used in the preparation of template DNA molecules for next-generation sequencing. Amplification increases the number of available molecules for sequencing but changes the representation of the template molecules in the amplified product and introduces random errors. Such changes in representation hinder applications requiring accurate quantification of template molecules, such as allele calling or estimation of microbial diversity. We present a simple method to count the number of template molecules using degenerate bases and show that it improves genotyping accuracy and removes noise from PCR amplification. This method can be easily added to existing DNA library preparation techniques and can improve the accuracy of variant calling.

                              PMID: 21490082 [PubMed - indexed for MEDLINE] PMCID: PMC3130290

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              23 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              24 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              20 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              52 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X