SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Oxford Nanopore



Similar Threads
Thread Thread Starter Forum Replies Last Post
ONT MAP - what do you plan to do with it? BBoy The Pipeline 3 03-10-2014 09:45 AM
Slides from a talk on genome assembly & Assemblathon 2 kbradnam General 0 04-25-2013 10:25 AM
Let's Talk About TruSeq farrel75 Sample Prep / Library Generation 4 06-20-2012 03:55 AM
ONT error model and quality scoring SillyPoint The Pipeline 0 02-21-2012 07:21 AM
Oxford Nanopore mccullou The Pipeline 0 10-22-2008 09:05 AM

Reply
 
Thread Tools
Old 09-28-2014, 07:04 AM   #21
robp
Member
 
Location: Stony Brook, NY

Join Date: Aug 2013
Posts: 13
Default

Quote:
Originally Posted by nickloman View Post
Hi robp-- Sadly the base caller is proprietary software and I am not aware of any documentation about how it works. It would be great if someone hot on HMMs and the Viterbi algorithm could try and implement a reference open-source base caller to serve as a foundation for improvements. Some more details about how the nanopore base caller works might be gleaned from the FAST5 files.
Yea, I agree. It's an interesting computational problem (I'm a computer scientist by trade), and I can think of at least a few ways an HMM-based base caller could be improved and a few other ways a potentially superior base caller using a different methodology could be built. I'm guessing there is already some magic they're doing, because looking through the log files in your data, I see things like:

Quote:
2014-08-20 23:27:46,393 Basecalling template data.
2014-08-20 23:27:46,394 Selected model: "/opt/metrichor/model/r7/template_median41pA.model".
and

Quote:
2014-08-20 23:27:59,091 Basecalling complement data.
2014-08-20 23:27:59,092 Selected model: "/opt/metrichor/model/r7/complement_median41pA_pop2.model".
which suggests that they have separately trained models to call the template and complement strand (and potentially, multiple models for each). Anyway, working with ONP basecalling is one of the potential final projects in my comp bio. class, and I really hope at least one group of students picks it .
robp is offline   Reply With Quote
Old 09-28-2014, 07:54 AM   #22
ymc
Senior Member
 
Location: Hong Kong

Join Date: Mar 2010
Posts: 498
Default

Quote:
Originally Posted by robp View Post
Yea, I agree. It's an interesting computational problem (I'm a computer scientist by trade), and I can think of at least a few ways an HMM-based base caller could be improved and a few other ways a potentially superior base caller using a different methodology could be built. I'm guessing there is already some magic they're doing, because looking through the log files in your data, I see things like:


and



which suggests that they have separately trained models to call the template and complement strand (and potentially, multiple models for each). Anyway, working with ONP basecalling is one of the potential final projects in my comp bio. class, and I really hope at least one group of students picks it .
I think someone who can just write a naive Gaussian mixture HMM caller based on the assumption that the hidden states are the 4^5 states representing the all possible 5-mers according to some blog posts describing their HMM.

Do the model files have 4^5 states?
ymc is offline   Reply With Quote
Old 09-28-2014, 08:14 AM   #23
robp
Member
 
Location: Stony Brook, NY

Join Date: Aug 2013
Posts: 13
Default

Quote:
Originally Posted by ymc View Post
I think someone who can just write a naive Gaussian mixture HMM caller based on the assumption that the hidden states are the 4^5 states representing the all possible 5-mers according to some blog posts describing their HMM.

Do the model files have 4^5 states?
Well, again, we don't really know because the software that does the base-calling is actually remote (on the cloud, I believe) and proprietary. So, we don't really know what's in the model files or how they were trained. I'd assume, however, that the model file would have all of the necessary start (maybe uniform/uninformative) and transition probs.
robp is offline   Reply With Quote
Old 10-01-2014, 07:00 PM   #24
frozenlyse
Senior Member
 
Location: Australia

Join Date: Sep 2008
Posts: 136
Default

Also a naive 4^5 state model would throw out the redundant information of each 5-mer signal overlapping the previous basecalls
frozenlyse is offline   Reply With Quote
Old 10-01-2014, 07:03 PM   #25
robp
Member
 
Location: Stony Brook, NY

Join Date: Aug 2013
Posts: 13
Default

Quote:
Originally Posted by frozenlyse View Post
Also a naive 4^5 state model would throw out the redundant information of each 5-mer signal overlapping the previous basecalls
I'm not quite sure I understand the reasoning here. We could have a model with 4^5 states, but there is only a non-zero probability of transition between consistent k-mers. For example, the state 'AAAAA' would only have non-zero transition probabilities to {'AAAAA', 'AAAAC', 'AAAAG', 'AAAAT'} --- the model would then not be "allowed" to consider transitions to other, un-connected 5-mers. Are we talking about different things here?
robp is offline   Reply With Quote
Old 10-01-2014, 07:14 PM   #26
frozenlyse
Senior Member
 
Location: Australia

Join Date: Sep 2008
Posts: 136
Default

hah yeah don't mind me, mind was off on a tangent and haven't had coffee yet!
frozenlyse is offline   Reply With Quote
Old 10-01-2014, 08:23 PM   #27
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by robp View Post
I'm not quite sure I understand the reasoning here. We could have a model with 4^5 states, but there is only a non-zero probability of transition between consistent k-mers. For example, the state 'AAAAA' would only have non-zero transition probabilities to {'AAAAA', 'AAAAC', 'AAAAG', 'AAAAT'} --- the model would then not be "allowed" to consider transitions to other, un-connected 5-mers. Are we talking about different things here?
Models containing zero-transition probabilities are not a good choice here, as they assume absolutes that are not warranted. Information from analog measurements translated to discrete scales are not proof of anything; in the analog system - or any unrestrained system - TATAT -> GCACC is, in fact, a possibly valid transition. What is the probability? Unknown, since the base-calling algorithm is secret. But assuming that AAAA* to AAA** is the only possible valid transition, from data using a secret base-caller, is foolish.

P.S. Don't get me wrong - I think we agree here.

Last edited by Brian Bushnell; 10-01-2014 at 09:23 PM.
Brian Bushnell is offline   Reply With Quote
Old 10-01-2014, 09:45 PM   #28
robp
Member
 
Location: Stony Brook, NY

Join Date: Aug 2013
Posts: 13
Default

Quote:
Originally Posted by Brian Bushnell View Post
Models containing zero-transition probabilities are not a good choice here, as they assume absolutes that are not warranted. Information from analog measurements translated to discrete scales are not proof of anything; in the analog system - or any unrestrained system - TATAT -> GCACC is, in fact, a possibly valid transition. What is the probability? Unknown, since the base-calling algorithm is secret. But assuming that AAAA* to AAA** is the only possible valid transition, from data using a secret base-caller, is foolish.

P.S. Don't get me wrong - I think we agree here.
Hi Brian,

I agree with you (i.e. that a zero probability would be a bad idea here). For example, there are almost certainly, e.g. incidents of slippage in the molecule, speed-ups, slow-downs, etc. that would not be accounted for by a model that forces such transitions. If I actually pull out the data for the called-states from one of the reads using poretools, I can see that in a ~3600 basepair read, most of the transitions are of the expected form (i.e. the state at time i shares an overlap of 4 bases with the state at time i+1). However, there seem to be a few instances where the state shifts by 2 bases, and a handful of instances where the state remains the same. So I would assume their actual model has non-zero transitions at least for these common cases of skipping a base and stalling. However, I'm not sure if it's a "full" model in which all transitions are possible with some non-zero probability, or not.
robp is offline   Reply With Quote
Old 10-04-2014, 01:32 AM   #29
seqsense
Junior Member
 
Location: Asia

Join Date: Feb 2014
Posts: 5
Default

Quote:
Originally Posted by Brian Bushnell View Post
I agree - it seems plausible to address some of the purported deficiencies in the current Nanopore system through primarily computational means.

As an unrelated side-note, Illumina's NextSeq systems - in my testing - give vastly inferior output compared to HiSeq or MiSeq (and the data was certified by Illumina as being in-spec). I believe this may largely be due to the software; improved base-calling software may be able to substantially improve the output of NextSeq, or other new platforms. That said, for a market-dominant company to release a new product that is undeniably inferior to prior products, indicates to me that sequencing companies have good reason to support alternatives, if they desire better data.
I don't think improved bioinformatics will address the issues with the nextseq chemistry. It's fundamentally flawed and it's apparent the Illumina don't understand the chemistry probably because they acquired it from Solexa as fully functioning.

Increased sterics, different electrostatics, dark and so blind base-calling, increased probability of mismatches compared to the original Solexa chemistry are basic issues introduced by the nextseq chemistry. It can only be downhill for the accuracy as a compromised for decreased hardware requirements.

What appears to be a relatively simple change is far from that and also makes comparison with the huge swathes of existing data problematic.
seqsense is offline   Reply With Quote
Old 10-04-2014, 07:15 PM   #30
austinso
Member
 
Location: Bay area

Join Date: Jun 2012
Posts: 77
Default

Quote:
Originally Posted by seqsense View Post
Increased sterics, different electrostatics, dark and so blind base-calling, increased probability of mismatches compared to the original Solexa chemistry are basic issues introduced by the nextseq chemistry.
I appreciate the concern of using a true binary representation of bases, namely the [0,0] one, but can you elaborate on what you mean by "increased sterics", "different electrostatics" and the basis for the belief that there is an "increased probability of mismatches"?
austinso is offline   Reply With Quote
Old 10-05-2014, 01:48 AM   #31
seqsense
Junior Member
 
Location: Asia

Join Date: Feb 2014
Posts: 5
Default

Quote:
Originally Posted by austinso View Post
I appreciate the concern of using a true binary representation of bases, namely the [0,0] one, but can you elaborate on what you mean by "increased sterics", "different electrostatics" and the basis for the belief that there is an "increased probability of mismatches"?

Different electrostatics - the structure of labeled nucleotides has changed and with that comes different electronic fields which can influence the physical behaviour of the bases especially with flat, aromatic dyes that can interact with each other, eg pi-stacking, and DNA in a number of ways, eg intercalation. This has introduced new, less understood biases and might significantly impact the chemistry of incorporation

Increased sterics - so in the nextseq kit T and C have single fluorescent labels so unless the types of dyes have changed from the original this shouldn't change their individual incorporation chemistry but may change in relation to the new G and A.

However, A now has two fluorescent dyes and G has no fluorescent dye. This changes their spacial volume significantly with the former now larger and the latter smaller.

In the absence of competitive incorporation ie no A present, G will pair with T. I don't believe that should be news to anyone.

Hence, in the situation created by Illumina having no label on G and two labels on A the competition between A and G for incorporation with T has now been skewed due to steric hindrance toward misincorporation of G with T. It's now more difficult for A to pair with T because it's bigger and easier for G to mispair with T because it's smaller.

These changes have only corrupted the most valuable part of the Illumina system. The sequencing chemistry has been compromised, in the true meaning of the word, so the system can be made cheaper by removal of two lasers and the knock on cost savings with less informatics required.

I feel they've made a fatal error here because they don't understand what they were given by those who did.
seqsense is offline   Reply With Quote
Old 10-05-2014, 09:06 AM   #32
austinso
Member
 
Location: Bay area

Join Date: Jun 2012
Posts: 77
Default

Quote:
Originally Posted by seqsense View Post
Hence, in the situation created by Illumina having no label on G and two labels on A the competition between A and G for incorporation with T has now been skewed due to steric hindrance toward misincorporation of G with T. It's now more difficult for A to pair with T because it's bigger and easier for G to mispair with T because it's smaller.
I'm not necessarily disagreeing with your extrapolations from biophysics (as you peppered with "might" and "may"), but evidence of this should be readily apparent in the data, then...

Just curious from a "should I get v1 or wait until v2" perspective is all...
austinso is offline   Reply With Quote
Old 10-05-2014, 06:44 PM   #33
seqsense
Junior Member
 
Location: Asia

Join Date: Feb 2014
Posts: 5
Default

Quote:
Originally Posted by austinso View Post
I'm not necessarily disagreeing with your extrapolations from biophysics (as you peppered with "might" and "may"), but evidence of this should be readily apparent in the data, then...

Just curious from a "should I get v1 or wait until v2" perspective is all...
There's no biophysics, it's just plain chemistry. It might fall out for resequencing but for de novo it's an issue because it's not predictable. There will be an increase in G:T pairing although invisible due to the lack of dye and it will occur toward the end of the ~150 base pair read length where the synthesised dsDNA terminally impedes incorporation due to structural complexity.

Besides, I wrote this in response to a comment that the nextseq results appear to be inferior to the original chemistry. I can't think of any other reasons why this should be other than the ones I have stated as they appear obvious to me to be the most likely causes.
seqsense is offline   Reply With Quote
Old 10-05-2014, 07:36 PM   #34
nucacidhunter
Jafar Jabbari
 
Location: Melbourne

Join Date: Jan 2013
Posts: 1,238
Default

I wonder if incorporation of G instead of A during sequencing with NextSeq as proposed by seqsence would explain the observation posted in this tread: http://seqanswers.com/forums/showthr...hlight=nextseq

Quote:
poly-G in NextSeq
________________________________________
Hi,
I just received NextSeq paired-end results (45 bp 1st read and 40 bp second read) and I noticed (using FastQC) that about 1-2% of the second read is poly-G. I known that G has no "colour" so it probably means that these spots are not detected in the paired run but what is the cause for that? Is it common to get this number of failing paired reads? Have someone ran into this before?
Thanks
By the way, the first read also contains poly-G but for very few reads.
Has anyone observed similar results?
nucacidhunter is offline   Reply With Quote
Old 10-05-2014, 09:04 PM   #35
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

This isn't really the place to discuss this (though I did bring it up); perhaps it should be moved to the Illumina forum? But anyway, tomorrow I'll post some of my NextSeq graphs. They have a very badly skewed A/T ratio that gets worse toward the read end. I'm not sure why; I had assumed it was the base caller, but it could be the chemistry. The C/G ratio seems fine.
Brian Bushnell is offline   Reply With Quote
Old 10-06-2014, 04:12 PM   #36
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

I started a thread here containing an analysis of some of our NextSeq data, with HiSeq 2000 for comparison.
Brian Bushnell is offline   Reply With Quote
Old 10-06-2014, 10:10 PM   #37
ymc
Senior Member
 
Location: Hong Kong

Join Date: Mar 2010
Posts: 498
Default

Hi Minion users, do you need to treat the samples with DNase to do viral sequencing? I learned it was necessary to do so with Illumina machines.

Thanks in advance for your reply
ymc is offline   Reply With Quote
Old 10-07-2014, 05:19 PM   #38
lkral
Member
 
Location: Carrollton, GA

Join Date: May 2011
Posts: 27
Default

If anyone in the MAP can talk about this, how does one prepare samples forMinION cDNA reads? Is it possible to get the entire length of the transcript sequenced or does the cDNA have to be fragmented during the sample prep? Thanks.
lkral is offline   Reply With Quote
Old 10-10-2014, 04:58 PM   #39
ymc
Senior Member
 
Location: Hong Kong

Join Date: Mar 2010
Posts: 498
Default

Do you guys think ONT can replace Illumina in the quantitation space? I think Illumina's fixed length reads probably is more suitable for quantitation, right?
ymc is offline   Reply With Quote
Old 10-10-2014, 05:31 PM   #40
robp
Member
 
Location: Stony Brook, NY

Join Date: Aug 2013
Posts: 13
Default

Hi ymc,

Certainly in the short term, I don't think that long read technology will replace Illumnia-style technology for quantification. The problems tackled by long reads are different. It may be great (once we can get the accuracy up) for isoform resolution, but the sheer number of reads is currently too small to be useful for many forms of quantification.
robp is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:51 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO