Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Originally posted by robp View Post
    I also think an algorithm for dealing with this would give rise to a very interesting CS paper. I'd be willing to bet that changes in molecular speed affect the resulting signal in detectable ways, and that modifying the underlying HMM to account for this is possible. ONP base-calling definitely seems like an interesting computational problem.
    I agree - it seems plausible to address some of the purported deficiencies in the current Nanopore system through primarily computational means.

    As an unrelated side-note, Illumina's NextSeq systems - in my testing - give vastly inferior output compared to HiSeq or MiSeq (and the data was certified by Illumina as being in-spec). I believe this may largely be due to the software; improved base-calling software may be able to substantially improve the output of NextSeq, or other new platforms. That said, for a market-dominant company to release a new product that is undeniably inferior to prior products, indicates to me that sequencing companies have good reason to support alternatives, if they desire better data.

    Comment


    • #17
      Originally posted by Brian Bushnell View Post
      As an unrelated side-note, Illumina's NextSeq systems - in my testing - give vastly inferior output compared to HiSeq or MiSeq (and the data was certified by Illumina as being in-spec). I believe this may largely be due to the software; improved base-calling software may be able to substantially improve the output of NextSeq, or other new platforms. That said, for a market-dominant company to release a new product that is undeniably inferior to prior products, indicates to me that sequencing companies have good reason to support alternatives, if they desire better data.
      That is an interesting observation.

      Once sequencing becomes a commodity, finer points of how one got the sequence become moot. I find it striking how parallel everything seems to be between microarrays in early 2000's and HTS now.

      An attractive price point (nicely slotted between the MiSeq and HiSeq) and a strong sales push helps seal the deal on NextSeq in most places.

      Comment


      • #18
        Originally posted by samanta View Post
        The situation seems to have improved somewhat after the company allowed Nick Loman to release his data (check our blog for link), and Michael Schatz posted his slides with the kind of information one needs to make decisions -

        At last we get the analysis of Oxford Nanopore data that we had been looking for since first day. Michael Schatz posted the GI2014 slides of James Gurtowski from his lab in his website.
        Just for the record, you do not need permission from Oxford Nanopore to release data after you self-certify the "burn-in" and I did not seek it. One of the reasons I took until September to release anything is that there were teething troubles on the laboratory side getting sufficient yields of full 2D reads. Now the protocol is sorted out these full 2D yields are much better with the R7.3 chemistry and this will become the standard for Illumina-style 'passing filter (PF)' reads for nanopore. Full 2D reads means that the fragment chemistry is working, with a hairpin and hairpin motor successfully ligated. This controls the speed of the complement strand which has a huge effect on accuracy.

        Some indications of full 2D performance can be seen in Figure 2 at:

        Background The MinION™ is a new, portable single-molecule sequencer developed by Oxford Nanopore Technologies. It measures four inches in length and is powered from the USB 3.0 port of a laptop computer. By measuring the change in current produced when DNA strands translocate through and interact with a charged protein nanopore the device is able to deduce the underlying nucleotide sequence. Findings We present a read dataset from whole-genome shotgun sequencing of the model organism Escherichia coli K-12 substr. MG1655 generated on a MinION™ device during the early-access MinION Access Program (MAP). Sequencing runs of the MinION™ are presented, one generated using R7 chemistry (released in July 2014) and one using R7.3 (released in September 2014). Conclusions Base-called sequence data are provided to demonstrate the nature of data produced by the MinION™ platform and to encourage the development of customised methods for alignment, consensus and variant calling, de novo assembly and scaffolding. FAST5 files containing event data within the HDF5 container format are provided to assist with the development of improved base-calling methods. Datasets are provided through the GigaDB database


        And of course the data is fully available (including the underlying signal measurements).

        Comment


        • #19
          Originally posted by nickloman View Post
          Just for the record, you do not need permission from Oxford Nanopore to release data after you self-certify the "burn-in" and I did not seek it. One of the reasons I took until September to release anything is that there were teething troubles on the laboratory side getting sufficient yields of full 2D reads. Now the protocol is sorted out these full 2D yields are much better with the R7.3 chemistry and this will become the standard for Illumina-style 'passing filter (PF)' reads for nanopore. Full 2D reads means that the fragment chemistry is working, with a hairpin and hairpin motor successfully ligated. This controls the speed of the complement strand which has a huge effect on accuracy.

          Some indications of full 2D performance can be seen in Figure 2 at:

          Background The MinION™ is a new, portable single-molecule sequencer developed by Oxford Nanopore Technologies. It measures four inches in length and is powered from the USB 3.0 port of a laptop computer. By measuring the change in current produced when DNA strands translocate through and interact with a charged protein nanopore the device is able to deduce the underlying nucleotide sequence. Findings We present a read dataset from whole-genome shotgun sequencing of the model organism Escherichia coli K-12 substr. MG1655 generated on a MinION™ device during the early-access MinION Access Program (MAP). Sequencing runs of the MinION™ are presented, one generated using R7 chemistry (released in July 2014) and one using R7.3 (released in September 2014). Conclusions Base-called sequence data are provided to demonstrate the nature of data produced by the MinION™ platform and to encourage the development of customised methods for alignment, consensus and variant calling, de novo assembly and scaffolding. FAST5 files containing event data within the HDF5 container format are provided to assist with the development of improved base-calling methods. Datasets are provided through the GigaDB database


          And of course the data is fully available (including the underlying signal measurements).
          Hi Nick! First, thanks for getting the manuscript out there as a pre-print. Also, thanks for making the data available in all it's glory. I am curious if you know if ONT's basecaller is publicly available, or if it's currently proprietary software. I'm interested in learning more about how it's working, but apart from the fact that it "uses an HMM" and the reference to the Timp paper from a couple years ago , there doesn't seem to be too much in the way of details.

          Comment


          • #20
            Originally posted by robp View Post
            Hi Nick! First, thanks for getting the manuscript out there as a pre-print. Also, thanks for making the data available in all it's glory. I am curious if you know if ONT's basecaller is publicly available, or if it's currently proprietary software. I'm interested in learning more about how it's working, but apart from the fact that it "uses an HMM" and the reference to the Timp paper from a couple years ago , there doesn't seem to be too much in the way of details.
            Hi robp-- Sadly the base caller is proprietary software and I am not aware of any documentation about how it works. It would be great if someone hot on HMMs and the Viterbi algorithm could try and implement a reference open-source base caller to serve as a foundation for improvements. Some more details about how the nanopore base caller works might be gleaned from the FAST5 files.

            Comment


            • #21
              Originally posted by nickloman View Post
              Hi robp-- Sadly the base caller is proprietary software and I am not aware of any documentation about how it works. It would be great if someone hot on HMMs and the Viterbi algorithm could try and implement a reference open-source base caller to serve as a foundation for improvements. Some more details about how the nanopore base caller works might be gleaned from the FAST5 files.
              Yea, I agree. It's an interesting computational problem (I'm a computer scientist by trade), and I can think of at least a few ways an HMM-based base caller could be improved and a few other ways a potentially superior base caller using a different methodology could be built. I'm guessing there is already some magic they're doing, because looking through the log files in your data, I see things like:

              2014-08-20 23:27:46,393 Basecalling template data.
              2014-08-20 23:27:46,394 Selected model: "/opt/metrichor/model/r7/template_median41pA.model".
              and

              2014-08-20 23:27:59,091 Basecalling complement data.
              2014-08-20 23:27:59,092 Selected model: "/opt/metrichor/model/r7/complement_median41pA_pop2.model".
              which suggests that they have separately trained models to call the template and complement strand (and potentially, multiple models for each). Anyway, working with ONP basecalling is one of the potential final projects in my comp bio. class, and I really hope at least one group of students picks it .

              Comment


              • #22
                Originally posted by robp View Post
                Yea, I agree. It's an interesting computational problem (I'm a computer scientist by trade), and I can think of at least a few ways an HMM-based base caller could be improved and a few other ways a potentially superior base caller using a different methodology could be built. I'm guessing there is already some magic they're doing, because looking through the log files in your data, I see things like:


                and



                which suggests that they have separately trained models to call the template and complement strand (and potentially, multiple models for each). Anyway, working with ONP basecalling is one of the potential final projects in my comp bio. class, and I really hope at least one group of students picks it .
                I think someone who can just write a naive Gaussian mixture HMM caller based on the assumption that the hidden states are the 4^5 states representing the all possible 5-mers according to some blog posts describing their HMM.

                Do the model files have 4^5 states?

                Comment


                • #23
                  Originally posted by ymc View Post
                  I think someone who can just write a naive Gaussian mixture HMM caller based on the assumption that the hidden states are the 4^5 states representing the all possible 5-mers according to some blog posts describing their HMM.

                  Do the model files have 4^5 states?
                  Well, again, we don't really know because the software that does the base-calling is actually remote (on the cloud, I believe) and proprietary. So, we don't really know what's in the model files or how they were trained. I'd assume, however, that the model file would have all of the necessary start (maybe uniform/uninformative) and transition probs.

                  Comment


                  • #24
                    Also a naive 4^5 state model would throw out the redundant information of each 5-mer signal overlapping the previous basecalls

                    Comment


                    • #25
                      Originally posted by frozenlyse View Post
                      Also a naive 4^5 state model would throw out the redundant information of each 5-mer signal overlapping the previous basecalls
                      I'm not quite sure I understand the reasoning here. We could have a model with 4^5 states, but there is only a non-zero probability of transition between consistent k-mers. For example, the state 'AAAAA' would only have non-zero transition probabilities to {'AAAAA', 'AAAAC', 'AAAAG', 'AAAAT'} --- the model would then not be "allowed" to consider transitions to other, un-connected 5-mers. Are we talking about different things here?

                      Comment


                      • #26
                        hah yeah don't mind me, mind was off on a tangent and haven't had coffee yet!

                        Comment


                        • #27
                          Originally posted by robp View Post
                          I'm not quite sure I understand the reasoning here. We could have a model with 4^5 states, but there is only a non-zero probability of transition between consistent k-mers. For example, the state 'AAAAA' would only have non-zero transition probabilities to {'AAAAA', 'AAAAC', 'AAAAG', 'AAAAT'} --- the model would then not be "allowed" to consider transitions to other, un-connected 5-mers. Are we talking about different things here?
                          Models containing zero-transition probabilities are not a good choice here, as they assume absolutes that are not warranted. Information from analog measurements translated to discrete scales are not proof of anything; in the analog system - or any unrestrained system - TATAT -> GCACC is, in fact, a possibly valid transition. What is the probability? Unknown, since the base-calling algorithm is secret. But assuming that AAAA* to AAA** is the only possible valid transition, from data using a secret base-caller, is foolish.

                          P.S. Don't get me wrong - I think we agree here.
                          Last edited by Brian Bushnell; 10-01-2014, 09:23 PM.

                          Comment


                          • #28
                            Originally posted by Brian Bushnell View Post
                            Models containing zero-transition probabilities are not a good choice here, as they assume absolutes that are not warranted. Information from analog measurements translated to discrete scales are not proof of anything; in the analog system - or any unrestrained system - TATAT -> GCACC is, in fact, a possibly valid transition. What is the probability? Unknown, since the base-calling algorithm is secret. But assuming that AAAA* to AAA** is the only possible valid transition, from data using a secret base-caller, is foolish.

                            P.S. Don't get me wrong - I think we agree here.
                            Hi Brian,

                            I agree with you (i.e. that a zero probability would be a bad idea here). For example, there are almost certainly, e.g. incidents of slippage in the molecule, speed-ups, slow-downs, etc. that would not be accounted for by a model that forces such transitions. If I actually pull out the data for the called-states from one of the reads using poretools, I can see that in a ~3600 basepair read, most of the transitions are of the expected form (i.e. the state at time i shares an overlap of 4 bases with the state at time i+1). However, there seem to be a few instances where the state shifts by 2 bases, and a handful of instances where the state remains the same. So I would assume their actual model has non-zero transitions at least for these common cases of skipping a base and stalling. However, I'm not sure if it's a "full" model in which all transitions are possible with some non-zero probability, or not.

                            Comment


                            • #29
                              Originally posted by Brian Bushnell View Post
                              I agree - it seems plausible to address some of the purported deficiencies in the current Nanopore system through primarily computational means.

                              As an unrelated side-note, Illumina's NextSeq systems - in my testing - give vastly inferior output compared to HiSeq or MiSeq (and the data was certified by Illumina as being in-spec). I believe this may largely be due to the software; improved base-calling software may be able to substantially improve the output of NextSeq, or other new platforms. That said, for a market-dominant company to release a new product that is undeniably inferior to prior products, indicates to me that sequencing companies have good reason to support alternatives, if they desire better data.
                              I don't think improved bioinformatics will address the issues with the nextseq chemistry. It's fundamentally flawed and it's apparent the Illumina don't understand the chemistry probably because they acquired it from Solexa as fully functioning.

                              Increased sterics, different electrostatics, dark and so blind base-calling, increased probability of mismatches compared to the original Solexa chemistry are basic issues introduced by the nextseq chemistry. It can only be downhill for the accuracy as a compromised for decreased hardware requirements.

                              What appears to be a relatively simple change is far from that and also makes comparison with the huge swathes of existing data problematic.

                              Comment


                              • #30
                                Originally posted by seqsense View Post
                                Increased sterics, different electrostatics, dark and so blind base-calling, increased probability of mismatches compared to the original Solexa chemistry are basic issues introduced by the nextseq chemistry.
                                I appreciate the concern of using a true binary representation of bases, namely the [0,0] one, but can you elaborate on what you mean by "increased sterics", "different electrostatics" and the basis for the belief that there is an "increased probability of mismatches"?

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM
                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Yesterday, 06:37 PM
                                0 responses
                                11 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, Yesterday, 06:07 PM
                                0 responses
                                10 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-22-2024, 10:03 AM
                                0 responses
                                51 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-21-2024, 07:32 AM
                                0 responses
                                67 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X