Seqanswers Leaderboard Ad

**nucacidhunter** · 05-04-2017, 02:48 AM

First thing to consider is the need for amplification of single cell genome: currently the best methods amplified DNA length would be around 10 kb limiting benefits of obtaining long reads by any platform. Unfortunately methods that output longest DNA length are prone to chimer production which will affect detection of structural variants.

1- Promethion is in early access α release and with limited production capacity (cannot run 48 flow cells yet) and they just have shipped flow cells couple of weeks ago. I have not seen any non-ONT released data. The accuracy mentioned is for the best runs not the average run and they are abandoning 2D reads due to litigation.

2- There are different plans for consumables and the more one buys it gets cheaper but the money spend in the lowest price point is for over million dollars purchase. The length limitation of input DNA also means lower sequencing output.

3- To my knowledge installation cost is additional and significant depending on location. The system also requires purchasing or having access to significant high speed computing hardware.

4- This is disadvantage as one gets hang of working with one chemistry another one comes up resulting in inconsistent and non-camparable data.

**WhatsOEver** · 05-04-2017, 05:30 AM

Thanks @nucacidhunter for your input.

I also thought that the length of our DNA fragments would be a major limitation, but actually heard that it probably isn't. This is because it has no influence on the lifetime of the pore if you sequence 1x 10kb or 10x 1kb fragments. The "only" thing which requires adjustment is the amount of added adapter (containing the ssDNA leader sequence). Do you have other information?

**gringer** · 05-06-2017, 05:34 PM

You're probably better off getting a GridION X5 if you only want to do a few human genomes. It'll be available in a couple of weeks (but will probably have a substantial shipping delay due to demand exceeding supply), and uses identical flow cells to the MinION (which are fairly reliable and well-tested now).

Output for a properly-prepared sample is currently 5-15Gb, so you'll need about 10 flow cells for a 10X coverage project (consumable cost $300/$500 USD per flow cell depending on capital expenditure).

However, you should have another think about what you want to do, and whether the coverage is overkill. A full-genome structural analysis (looking for chromosome-scale modifications) can be done at 1-3X coverage on a single flow cell. An exome sequencing experiment looking for single-nucleotide variants can also be done on a single flow cell at 10X~30X coverage depending on how an "exome" is defined.

I'd recommend doing a pilot run on a MinION first ($1000 per run, no substantial delay in ordering MinIONs or flow cells / reagents), because a single MinION run might be sufficient for your needs.

**gringer** · 05-06-2017, 05:56 PM

Originally posted by WhatsOEver View Post

1) Accuracy of >95%
Is this a consensus acc from assembly/mapping? As this would be coverage dependent, it is meaningless for me... What about individual reads (1D or 2D) accuracy? I have looked briefly into the available data on github which looks more like 50-70% acc for single reads (but I might just have been unlucky with my selected genomic regions...)

Single read modal accuracy for nanopore reads is about 85-90% at the moment; this will increase in the future through software improvements, and reads can be re-called at a later date to incorporate new accuracy improvements. As far as I'm aware, the electrical sensor on the MinION flow cells is the same as the one that was used when the MinION was first released (and giving accuracies of 70%) -- all accuracy improvements have been software and chemistry changes (mostly software).

But if you're doing 10X coverage on known variants, the single read accuracy isn't all that important. There is a bit of systematic bias in the calling (particularly around long homopolymers), which means perfect single-base calling is not possible even at infinite coverage. From an unmethylated whole-genome amplification, consensus calling accuracy for known variants at single nucleotides should be at least q30, and essentially perfect for structural variants (assuming they are covered at all).

**WhatsOEver** · 05-08-2017, 02:25 AM

Thanks for the info @gringer esp on pricing and output.
I would be grateful if you could further comment on the following:

Originally posted by gringer

Single read modal accuracy for nanopore reads is about 85-90% at the moment;
[...]
A full-genome structural analysis (looking for chromosome-scale modifications) can be done at 1-3X coverage on a single flow cell.

Sorry, but based on the data I have seen (which is the NA12878 Human Reference on github produced with the latest(?) R9.4 chemistry), I cannot believe these statements.
For example, let's look at the following reference assemblies from ONT and Pacbio:

ONT mapping:

Pacbio mapping:

If you exclude InDels from accuracy calculation, you might me correct with 85%+...
Also if you look into the literature (https://www.nature.com/articles/ncomms11307 <- yes, it is quite old, but the best I could find...) you get lower values. Maybe it is simply an issue with using bwa as aligner, but if it isn't the best working, why is the reference consortium using it?!
Concerning chromosomal rearrangements: Support from 1-3 reads can imo not truly be considered as sufficient support for anything. With both methods you will get artefacts (see pictures) with largely clipped reads that couldn't be mapped elsewhere. In addition to the high error rate you will get numerous false positives. Because I'm working with amplified data, which is of course far from being equally well amplified over the whole genome, I would also get numerous false negatives due to insufficient coverage with calculated 1-3X.

Originally posted by gringer

An exome sequencing experiment looking for single-nucleotide variants can also be done on a single flow cell at 10X~30X coverage depending on how an "exome" is defined.

True, but exome sequencing works very well with our Agilent/Illumina protocol and I don't see a real advantage of long reads for WES.

Originally posted by gringer

But if you're doing 10X coverage on known variants, the single read accuracy isn't all that important.

For homozygous maybe, for heterozygous probably not: I have two alleles which are probably not amplified to preserve the 50:50 ratio. This means I will have eventually only 2-4 reads for one allele and 6-8 for the other. With the high error rate and the variant on the lower amplified allele I would rather tag this position as uncertain/low support.

**Ola** · 05-09-2017, 02:02 AM

WhatsOEver, you are correct in that PacBio at this point gives a much cleaner alignment but of course with amplified DNA you will get lower throghput given the limited read numbers. For ONT throughput should be more or less the same if you have 5kb or 25kb fragments but you would need to do quality-filtering of the reads, and use additional programs such as nanopolish to get good SNP-calls. The new 1D^2 kit will improve single read accuracy significantly, and the latest basecalling gives slightly lower error rates also with old datasets like the NA12878 in your comparison.

PromethION is expected to give ~50 Gb per flow cell at start as each flowcell has more pores and possibly longer run time compared to MinION. Price depends on volume, starting at ~$650/fc for very large orders giving a significantly lower cost/Gb compared to current Sequel specs. The $135k for the PEAP includes reagents (60 flowcells and kits to run them).

**gringer** · 05-09-2017, 03:14 AM

... I'm not quite sure what happened with my previous response about this....

Here's a plot of sequencing data from our recently published chimeric read paper, taking only the reads that mapped (with LAST) to Ifnb (removing complications about gene isoforms and exon/intron confusion):

In general, the accuracy for nanopore will always be better than what can be found in any published papers. This is certainly the case for any sequencing done prior to May last year on R7.3 (and earlier) flow cells (2D consensus accuracy on R7.3 is similar to 1D single-read accuracy on R9.x), but is also the case even for runs done this year. For example, homopolymer resolution was not implemented in standard base callers until about a month ago, so any basecalled data prior to that (which includes my chimeric read paper, and the NA12878 data) is going to have at homopolymer issues that are not present in current base-called data. However, the data can be re-called; I notice that the NA12878 github page has a "scrappie" recalling for chromosome 20, which will be closer to the current accuracy. I'll have a go at re-calling the Ifnb data that I used for the above graph and see how much of a difference it makes.

However, it's worth also considering that the nanopore NA12878 data was probably done on unamplified DNA (certainly the more recent runs were, as PCR doesn't work so well over 100kb). For this reason there will be errors in the base-called sequence that are due to unmodelled contexts in the DNA. The PacBio system can only sequence what is modelled (more strictly, it produces a sequence that indicates what DNA bases the true sequence will hybridise to), so almost all of the modifications would be lost. Accuracy is currently higher when using amplified DNA on a nanopore, but this removes the possibility of re-calling in the future when different calling models have been developed to incorporate base modifications.

In any case, it's almost impossible to get anyone to change their view about the utility of the MinION because it's a highly-disruptive technology -- it changes almost everything about how sequencing is done. People will cling to whatever shreds of dissent they can find about nanopore sequencing, and fluidly shift onto anything else that can be found once the issues start disappearing (without a mention of the progress). The homopolymer straws are almost all gone, and the accuracy straws are looking pretty thin. What remains are mostly prejudices against change.

I'm very definitely an Oxford Nanopore fanboy, and recommend spending a little change on a big change. It's not much of a pocket-burn to spend $2000 USD to purchase the MinION and run a few flow cells as a pilot study to work out if it will be suitable for a given purpose. The thing is basically capital-free, and there's no obligation to continue with using it if it turns out to be useless.

**seq_bio** · 05-09-2017, 07:44 AM

Thanks, I think your point about it being impossible to change people's minds because it's disruptive is a tad pessimistic. I think it's natural (even if not always right) to be a bit skeptical about new tech considering that over-promising and under-delivering is common in NGS companies. The new tech will be embraced once they realize what it does for them.

I think you are suggesting that we use it for projects where you say it works, low coverage structural variants, methylation etc. And the low cost appears to be a key part of your argument.
But isn't the right approach to just put out there a convincing enough study ? Say, an assembly from purely nanopore reads that have high consensus accuracy and long enough reads and then people will more than keen to move over as it's cheap, has high throughput and is accurate. I don't understand your point about accuracy being a thin straw.

Assembling the Cliveome

https://genomeinformatics.github.io/cliveome/

We recently participated in a collaborative effort to sequence, assemble, and analyze a human genome (GM12878) using the Oxford Nanopore MinION (Jain et al. 2017). As part of that project, Josh Quick and Nick Loman developed a nanopore sequencing protocol capable of generating “ultra-long” reads of length 100 kb and greater. In the paper we predict that reads of such length could enable the most continuous human assemblies to date, with NG50 contig sizes exceeding 30 Mbp. Thus far, we have only collected 5x coverage using the ultra-read protocol and cannot fully test this prediction. However, another human dataset, the “Cliveome”, lets us compare the effect of read length and coverage on nanopore assembly. Here we present a brief analysis of that assembly, which achieved a remarkable contig NG50 of 24.5 Mbp.

96.5% consensus accuracy is not really there right or 99% after corrections etc ? What am I missing apart from being one of those impossible to convince ?

I also have another question which you might perhaps be able to answer, throughput on the minion has been improved by moving the dna faster through the pore correct - (one would assume this if anything makes accuracy worse ) ? And base calling improvements are done primarily through software improvements, NNs, machine learning etc.

So in a way its like software defined sequencing as the hardware is really cheap. Then one would imagine that this algo trained on the platinum reference genome(s) would tell us about new regions.

Is it then likely that they will be different to those from say Pacbio ? even though both have a high consensus accuracy ? Is that a possible outcome ? I guess we don't have a study anywhere with a like for like comparison where both have 99.99% consensus accuracy ?

**gringer** · 05-09-2017, 11:15 AM

But isn't the right approach to just put out there a convincing enough study?

Studies exist. Whether or not they're "convincing enough" is entirely up to the readers.

I don't understand your point about accuracy being a thin straw.

Accuracy is a thin straw because it is almost entirely a software problem. It's being worked on, and any software improvements can largely be fed back to any R9 run (i.e. since about July last year) to substantially improve accuracy. The biggest issue is that it's really only ONT who is working on base-calling algorithms. I expect a few people who have been studying neural networks and wavelets for the better part of their lives would make light work of the base calling accuracy problem.

96.5% consensus accuracy is not really there right or 99% after corrections etc?

Yes, that's correct. The primary goal of the human assemblies is to generate assemblies that are as accurately contiguous as possible, rather than getting high accuracy at the single base level. Long reads are a huge advantage there, particulary in the hundreds of kilobases range.

The current ONT basecalling is trained mostly on bacteriophage lambda and E. coli, which have much simpler unmodelled DNA context. For ONT to be able to correctly call human genomic sequence, they need to add all the possible DNA base modifications into their calling model, and that's going to take quite a long time. Until then, single-base consensus accuracy will be lower than expected even at infinite coverage. It may be that the majority of the systematic [non-homopolymer] base-calling error is associated with modified bases, but we're not going to know that until a sufficiently complete model exists.

throughput on the minion has been improved by moving the dna faster through the pore correct - (one would assume this if anything makes accuracy worse

There have only been a couple of shifts in base calling speed, but at each step the accuracy was at least as good as it was for slower speeds. ONT made sure of this in their internal tests of base calling, and they have given me a plausible explanation for why accuracy might improve with sequencing speed. The explanation is that the DNA wants to move a lot faster, so a lot of effort is put in on the chemistry side of things to slow everything down, and there's much more chance for DNA to wiggle around while it's being ratcheted through at slower speeds. Move the DNA faster and the transitions between bases become more obvious because there's less movement/noise at each step.

The base transition speed has remained at 450 bases per second for at least the last 6 months, but flow cell yield has increased about 5 times since then. The majority of those yield fixes have been in software, and primarily around recognising when a pore is blocked and making sure the current at that specific pore is reversed quickly to unblock before the blockage becomes permanent. There have been some issues with sequencing kits as well due to reagents drying out in the freezer. It seems strange to think that yield improvements have been realised mostly by software fixes, but that is actually the case.

**gringer** · 05-09-2017, 12:00 PM

Originally posted by seq_bio View Post

Is it then likely that they will be different to those from say Pacbio ? even though both have a high consensus accuracy ? Is that a possible outcome ? I guess we don't have a study anywhere with a like for like comparison where both have 99.99% consensus accuracy ?

I could imagine that there will be some point in the future where both PacBio and Nanopore have perfect consensus accuracy (at, say, 20X coverage). In this case, the distinction will be in the things unrelated to accuracy, which is where (from my perspective) Nanopore wins on all counts.

Because PacBio is inherently a model-based approach to sequencing (i.e. by hybridisation), it's impossible to use it to detect things that we don't know about yet. Was that delay in hybridisation time due to a sugar, a deamination, or a methylation? How would PacBio detect abasic sites in a DNA sequence? What about pyrimidine dimers? I can imagine a situation where PacBio might introduce different chemistries to detect these different situations, but those chemistries and models need to be there before the observation can be made. This is much less of an issue with Nanopore, because the electrical signal has a lot more information in it. As an easy example, abasic sites produce a very characteristic increase in current, which is used by the basecallers to detect the presence of a hairpin (which has abasic sites in its sequence).

Read length is another point where Nanopore is starting to push ahead. Even if PacBio had perfect accuracy, much longer reads will be needed to fully resolve the human genome. End-to-end mapped single-strand nanopoore reads of over 800kb have been sequenced (and base-called), and double-strand reads of over a megabase of combined sequence have also been seen. Clive Brown has suggested that photobleaching might be an issue for really long PacBio reads. I don't know if that's true for PacBio, but I do know that it is a common issue for Illumina sequencing, requiring an intensity ramp over the course of a sequencing run. PacBio could probably work around that issue by continually replenishing fluorophores, but at a substantially increased expense.

The other advantage for Nanopore is speed. Just considering read transition time, a 450kb nanopore read with an average base transition time of 450 bases per second would take 1000 seconds (about 17 minutes) to go through the pore. After the motor protein / polymerase is ejected (which can take less than 1/10 of a second), the pore is ready for the next sequence. If all someone were looking for was a read over 100kb, they could run the MinION for less than 10 minutes and have a good chance of finding one (assuming the sample prep was up to the job). Whole contig bacterial assemblies from simple mixes can be sequenced and assembled in about an hour. There are people who have done simulations of diagnostic situations (e.g. Tb detection, antibiotic resistance) with "presence" results produced in less than half an hour, and "absence" results (alongside low-abundance positive controls) established with high likelihood in a few hours. A whole bunch of other things change from impossible to possible when the sequencing speed is considered (for individual reads, for analysis/turnaround time, and for sample preparation time).

.... There's always more, but I'll stop there, reiterating my previous point about the disruptive nature of this. Nanopore sequencing changes almost everything about how sequencing is done.

**seq_bio** · 05-09-2017, 01:50 PM

Thanks for that, it answered a few questions I had, esp that bit about improvement to yield from software alone was interesting - making me think of it in terms of software defined sequencing similar to software defined networking (SDN) etc. In terms of your comment on accuracy:

"The primary goal of the human assemblies is to generate assemblies that are as accurately contiguous as possible rather than getting high accuracy at the single base level. Long reads are a huge advantage there, particulary in the hundreds of kilobases range."

If we are looking to create de novo assemblies of high quality, you can't really do that with errors at the single base level right ? One can certainly make the case that nanopore data can be used to improve existing platinum genomes while having a higher threshold for error as hybrid strategies may be sufficient for most solutions.

But as of now, as a standalone system, It's not really ready for human WGS correct ? I think that's what Whatsoever was interested in if I understood that correctly.

**seq_bio** · 05-09-2017, 02:00 PM

Thanks for that, it answered a few questions I had, esp that bit about improvement to yield from software alone was interesting - making me think of it in terms of software defined sequencing similar to software defined networking (SDN) etc. In terms of your comment on accuracy:

"The primary goal of the human assemblies is to generate assemblies that are as accurately contiguous as possible rather than getting high accuracy at the single base level. Long reads are a huge advantage there, particulary in the hundreds of kilobases range."

If we are looking to create de novo assemblies of high quality, you can't really do that with errors at the single base level right ? One can certainly make the case that nanopore data can be used to improve existing platinum genomes while having a higher threshold for error as hybrid strategies may be sufficient for most solutions.

But as of now, as a standalone system, It's not really ready for human WGS correct ? I think that's what Whatsoever was interested in if I understood that correctly.

**Brian Bushnell** · 05-09-2017, 07:33 PM

For some reason a lot of posts in this thread are getting moderated... I'm not really sure why, but as a result, I'm not sure everyone is seeing everyone else's replies. I've unblocked the affected posts, so you may want to look back through the thread so you can be sure you are on the same page. I'll try to resolve this.

To add my two cents - personally, I like long Illumina reads, because as a developer, they are so much easier to work with

And they are so accurate! It's straightforward to make a good variant-caller for Illumina reads (well, let's restrict this to the HiSeq 2500 platform with 150bp reads, to be precise; some of their other platforms are much more problematic), in terms of SNPs, deletions, short insertions.

However, if you want to call SV's or deal with repeats, the balance changes and you need long reads. We bought a Promethion, but I don't think it's installed yet. We also have a couple Sequels, and recently transitioned production over to them because they are now consistently meeting/beating the RSII in quality and cost/yield metrics. But we mainly use PacBio for assembly, and for that purpose, we aim for 100x coverage which can achieve 99.99%+ accuracy, for a haploid (nominally 99.999% but I don't remember the exact numbers as measured). I'm not really sure what you would get at 30x for a repetitive diploid (though I should note that we also use low-depth PacBio for phasing variations called by Illumina on diploids).

High-error-rate long reads are great for variant phasing, particularly in conjunction with a second library of low-error-rate short reads. And they are great for SV/CNV calling, particularly since PacBio has a vastly lower coverage bias compared to Illumina (I'm unaware of the ONT bias rate, but I think it's similarly low). I'm less convinced about the utility of PacBio/ONT as a one-stop shop for all variant calls, when cost (and thus depth) is a consideration. Particularly, PacBio's error mode is not completely random as is often stated, but it is random enough to make self-correction possible and useful for achieving high accuracy (again, given sufficient coverage). But for SNPs and short indels alone, without phasing, you can currently get better results for less money with Illumina HiSeq 2500. For human sequencing (in which, unlike bacterial sequencing, the reagent costs outweigh the library-prep costs) it seems like it might be prudent to pursue a dual-library approach, with short and long reads on different platforms. In that case you don't need to pick a single platform that's optimal for everything.

**gringer** · 05-09-2017, 10:16 PM

Originally posted by seq_bio View Post

If we are looking to create de novo assemblies of high quality, you can't really do that with errors at the single base level right ?

That's correct. There's [currently] systematic error in ONT that means resolution with perfect single-base accuracy is not possible. Canu and Nanopolish do really well in fixing consensus errors, but they can't fix it to 100% accuracy. I'm optimistic that this can be dealt with in software (allowing existing runs to be recalled), but it's not there yet.

For de-novo assembly, nanopore works really well when it is used for initial scaffolding, and base call errors are cleaned up by mapping Illumina reads to the assembly and correcting based on those alignments. I used that approach for correcting mitochondrial sequence from a low-yield nanopore run done in July last year.

But, there is also substantial systematic error in Illumina sequencing. Illumina cannot properly resolve tandem repeat regions (such as in centromeric regions) where the repeat unit length is greater than the read length. I've got an example of this in my London Calling poster, where a 20kb repeat region was collapsed to 2kb in the current Illumina-based reference assembly. Whether or not such errors are included in definitions of "high-quality assemblies" is up to the person making the definition.

Topics	Statistics	Last Post
A Closer Look at the Enigmatic Genomes of Oikopleura dioica by seqadmin Started by seqadmin, 05-10-2024, 06:35 AM	0 responses 20 views 0 likes	Last Post by seqadmin 05-10-2024, 06:35 AM
Advanced Epigenome Editing Platform Explores Gene Regulation Mechanisms by seqadmin Started by seqadmin, 05-09-2024, 02:46 PM	0 responses 26 views 0 likes	Last Post by seqadmin 05-09-2024, 02:46 PM
Telomere Maintenance by PARP1: A New Perspective in Cancer Research by seqadmin Started by seqadmin, 05-07-2024, 06:57 AM	0 responses 21 views 0 likes	Last Post by seqadmin 05-07-2024, 06:57 AM
Enhanced Neoantigen Detection: Introducing NeoHunter by seqadmin Started by seqadmin, 05-06-2024, 07:17 AM	0 responses 21 views 0 likes	Last Post by seqadmin 05-06-2024, 07:17 AM

Seqanswers Leaderboard Ad

Announcement

ONT promethion for single cell human WGS?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News