SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa



Similar Threads
Thread Thread Starter Forum Replies Last Post
Illumina Nextra prep without using Illumina reagents crsimao Illumina/Solexa 4 04-14-2015 10:29 AM
Comparison between SOLiD, Illumina MiSeq and Illumina HiSeq NGS_New_User SOLiD 0 12-12-2012 11:37 AM
bowtie command line for Illumina Hiseq 2000 with Illumina 1.5+ quality encoding files rworthi Illumina/Solexa 4 09-28-2011 11:25 AM

Reply
 
Thread Tools
Old 07-14-2017, 04:43 AM   #81
cement_head
Senior Member
 
Location: Oxford, Ohio

Join Date: Mar 2012
Posts: 246
Default

Quote:
Originally Posted by GenoMax View Post
@cement_head: See if this blog post helps.
Okay. Thanks - that was really helpful. We're tilting towards ALWAYS doing PE RNA-Seq and using UMIs. Doesn't solve every problem, but I think it reduces a lot of issues.
cement_head is offline   Reply With Quote
Old 07-14-2017, 05:09 AM   #82
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by GenoMax View Post
@cement_head: See if this blog post helps.
As usual, GenoMax has the perfectly appropriate link...

In my latest test, NovaSeq only had a 4-5% duplication rate. That's using our own NovaSeq data rather than external data. Overall not a huge problem though it's certainly worth removing. I'm not sure why the number is lower than my previous tests on external data, indicating >12%; possibly the chemistry got better. (Edit - I should note that this run used lots of libraries from different organisms multiplexed together, which reduces the apparent duplication rate, but makes it more accurate. That should not be relevant to such a huge discrepency, though.)

This run was extremely high quality (average 99.6% identity to the reference, or ~Q24) so duplicates were easy to detect. I'm really quite impressed with NovaSeq quality. It's unfortunate that there are only 4 quality scores, but CalcTrueQuality seems to do good job of recalibrating them to the full range of 0-41, yielding a 0.04 average deviation from the correct quality, down from 1.1 on the raw data. 1.1 is still really good (better than the HiSeq 2500 I compared it to), but having only 4 quality scores makes many operations like trimming and merging less accurate. It's actually very impressive that NovaSeq managed, with 4 quality scores, to get better quality score accuracy than HiSeq 2500. I've drawn a couple of conclusions from this: 1) The HiSeq quality score algorithm is terrible. And 2) NovaSeq is calibrated for successful runs only and cannot produce correct quality scores if there are any anomalies (e.g., if there is a lighting failure producing no signal, it will still output really high quality scores even though all the data is wrong). With our previous unsuccessful run (there was a lighting failure), the average deviation from the correct quality was ~20 (2 orders of magnitude).

Last edited by Brian Bushnell; 07-14-2017 at 05:25 AM.
Brian Bushnell is offline   Reply With Quote
Old 07-14-2017, 05:45 AM   #83
cement_head
Senior Member
 
Location: Oxford, Ohio

Join Date: Mar 2012
Posts: 246
Default

Slightly off-topic, but related: INDEX swapping on patterned flow cells...

https://sequencing.qcfail.com/articl...uddle-samples/
cement_head is offline   Reply With Quote
Old 07-14-2017, 06:47 AM   #84
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

I calculated 8000 PPM of index swapping (cross-contamination) for our NovaSeq run with single indexes, and 120 PPM for dual indexes, when allowing zero barcode mismatches.
Brian Bushnell is offline   Reply With Quote
Old 07-14-2017, 07:14 AM   #85
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,030
Default

Quote:
In my latest test, NovaSeq only had a 4-5% duplication rate.
The important point is JGI probably made VERY GOOD quality libraries. With patterned FC's having clean libraries (with just the right sized inserts, zero primers and dimers) are critical to minimizing these issues. Since we are talking about "B"illions of reads losing some during dedupe should not cause a major loss. 2D barcoding seems essential (perhaps should be made mandatory).
GenoMax is offline   Reply With Quote
Old 07-14-2017, 08:09 AM   #86
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,317
Default

Quote:
Originally Posted by Brian Bushnell View Post
I calculated 8000 PPM of index swapping (cross-contamination) for our NovaSeq run with single indexes, and 120 PPM for dual indexes, when allowing zero barcode mismatches.
What went into that 8000 PPM (0.8%) calculation Brian? I mean, did you just count the number of swaps in a dual unique indexed run?

Anyone checked that figure for a HiSeq 2500 run? I know no one is complaining about index hopping on that instrument or a MiSeq, but it would happen at some rate.

--
Phillip
pmiguel is offline   Reply With Quote
Old 07-14-2017, 08:14 AM   #87
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,317
Default

Quote:
Originally Posted by GenoMax View Post
The important point is JGI probably made VERY GOOD quality libraries. With patterned FC's having clean libraries (with just the right sized inserts, zero primers and dimers) are critical to minimizing these issues. Since we are talking about "B"illions of reads losing some during dedupe should not cause a major loss. 2D barcoding seems essential (perhaps should be made mandatory).
From what I'm hearing, the NovaSeq doesn't have the major issues with amplicon lengths that the HiSeq4000 and X do. The NovaSeq is spec'ed to run 550bp no PCR DNA libraries, unlike the HiSeq patterned flowcell instruments.

--
Phillip
pmiguel is offline   Reply With Quote
Old 07-14-2017, 08:30 AM   #88
cement_head
Senior Member
 
Location: Oxford, Ohio

Join Date: Mar 2012
Posts: 246
Default

Quote:
Originally Posted by Brian Bushnell View Post
I calculated 8000 PPM of index swapping (cross-contamination) for our NovaSeq run with single indexes, and 120 PPM for dual indexes, when allowing zero barcode mismatches.
What is PPM?
cement_head is offline   Reply With Quote
Old 07-14-2017, 08:38 AM   #89
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,317
Default

Quote:
Originally Posted by cement_head View Post
What is PPM?
Parts Per Million.

--
Phillip
pmiguel is offline   Reply With Quote
Old 07-14-2017, 09:25 AM   #90
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by pmiguel View Post
What went into that 8000 PPM (0.8%) calculation Brian? I mean, did you just count the number of swaps in a dual unique indexed run?

Anyone checked that figure for a HiSeq 2500 run? I know no one is complaining about index hopping on that instrument or a MiSeq, but it would happen at some rate.

--
Phillip
The 8000 PPM was single-indexed. This was not an ideal test, but there were a few E.coli isolate libraries multiplexed with various other things (a lot of Chlamy, and various bacterial single-cells). Also, some were dual indexed and some were single-indexed, in the same run, and for whatever reason demultiplexing was done with only 6bp of the barcode for the single-indexed libraries rather than all 8 (allowing zero mismatches). So I'm not really sure what the rates would be in an ideal test environment. That said, for the reads that came out as this particular E.coli library, I concatenated all references for everything being sequenced together and ran:

Code:
seal.sh in=reads.fq stats=stats.txt ambig=toss clearzone=10
Everything hitting E.coli was considered correct, and everything hitting anything else was considered contamination. For the dual-indexed test I used a P.heparinus single-cell library with similar methodology.

I also tested a HiSeq run of the same E.coli library and calculated a 7 PPM contamination rate, but that's not really credible since I don't know what else was present on the plate in that run so I don't necessarily have the correct references (though there was definitely some Chlamy present). In the past I've seen various rates of cross contamination in HiSeq 2500 (<1PPM to >1000PPM) and it's actually quite hard to consistently reproduce the same numbers on different runs. The cross contamination comes from various sources, including physical contamination, though I think we've eliminated physical in our cross contamination current processes. NextSeq has generally yielded lower rates of cross contamination compared to HiSeq 2500 so we use that for our multiplexed single cells even though the quality is lower than HiSeq.

Last edited by Brian Bushnell; 07-14-2017 at 09:34 AM.
Brian Bushnell is offline   Reply With Quote
Old 07-18-2017, 07:23 AM   #91
cement_head
Senior Member
 
Location: Oxford, Ohio

Join Date: Mar 2012
Posts: 246
Default

Does INDEX swapping (hopping) occur because this (release and re-annealing) is the method for generating clusters within each nanocell, and the swapping is as the result of the DNA fragments (library frags) inadvertently jumping/hopping too far into the next nanocell?
cement_head is offline   Reply With Quote
Old 07-18-2017, 08:03 AM   #92
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,030
Default

See Illumina's white paper on index hopping here.
GenoMax is offline   Reply With Quote
Old 07-18-2017, 11:24 AM   #93
cement_head
Senior Member
 
Location: Oxford, Ohio

Join Date: Mar 2012
Posts: 246
Default

Quote:
Originally Posted by GenoMax View Post
See Illumina's white paper on index hopping here.
Still don't understand the ExAMP chemistry, and unless I missed something, this white paper doesn't explain it. Is it that it is proprietary and largely unknown? Would you happen to have a link to where it is explained? Thanks.

Aside: Hard to believe this has been going on this long and Illumina has been largely silent about this - one would think they would have issued a protocol change for ONLY dual-index libraries on nanocell instruments.
cement_head is offline   Reply With Quote
Old 07-18-2017, 07:49 PM   #94
nucacidhunter
Jafar Jabbari
 
Location: Melbourne

Join Date: Jan 2013
Posts: 1,231
Default

Exclusion Amplification (ExAmp) has been explained in the following video.
https://www.youtube.com/watch?v=pfZp5Vgsbw0

Following is the link for the patent:
https://www.google.com.au/patents/WO2013188582A1?cl=en
nucacidhunter is offline   Reply With Quote
Old 07-20-2017, 09:46 AM   #95
cement_head
Senior Member
 
Location: Oxford, Ohio

Join Date: Mar 2012
Posts: 246
Default

Quote:
Originally Posted by nucacidhunter View Post
Exclusion Amplification (ExAmp) has been explained in the following video.
https://www.youtube.com/watch?v=pfZp5Vgsbw0

Following is the link for the patent:
https://www.google.com.au/patents/WO2013188582A1?cl=en
The video wasn't overly helpful, but if I understand the patent description, they're saying that they've essentially hyper optimised the bridge amplification such that once a single seed molecule binds within a nanowell, after 14 rounds, it will dominate the signal during SBS? If true, then it must be within the first two rounds that the seed molecules drift from one nanowell to the next (this is the average transport vs average amplification rate that they constantly cite in the patent). Or, does the mispriming occur PRIOR to the initial hybridisation to the nanowell, and before the first round of bridge ampification?

I now understand the need for (a) super-clean libraries and (b) size optimised libraries - to beat the "average" diffusion rate(s) on these HiSeq3000/4000/X/NovaSeq platforms.

Here's the real question: how does one detect index swapped (hopped) reads? Do you have to have a reference? It would seem that the answer would be "yes", or as Illumina suggests in their white paper, one has to a priori have an idea of the expression levels/targets?

Last edited by cement_head; 07-20-2017 at 09:55 AM. Reason: clarity
cement_head is offline   Reply With Quote
Old 07-20-2017, 12:34 PM   #96
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,317
Default

The recommended method to detect an index swap is to use "Unique Dual Indexes". With these you don't use the same i7 index in multiple pairs. A given i7 index always goes with a fixed i5 index for the run. Then if you detect an i7 index with any i5 index other than its pair, you know an index hop has occurred and the reads are discarded.

This will remove all index hops the result of a single recombination event. It will also remove nearly all the double recombinations. So true index hops should be largely detectable.

As to what causes index hopping, I don't think that Illumina is sure. They seem mainly to have a list of "best practices" to use to lower their frequency.

I haven't looked in detail at the process of exclusion amplification either. But I presume that it involves some non-flowcell-tethered PCR amplification.

--
Phillip
pmiguel is offline   Reply With Quote
Old 07-21-2017, 01:38 AM   #97
nucacidhunter
Jafar Jabbari
 
Location: Melbourne

Join Date: Jan 2013
Posts: 1,231
Default

My understanding is that index hopping can happen any time in the pool which contains single stranded library fragments, a partially complementary oligo (from PCR or adapter oligos) that can pair with a strand and ExAmp reagents. Amplification is isothermal and is at optimum in the temperature maintained during clustering but like most polymerase there should be some low level activity in non-optimal temperatures as well. These are the reasons that preparing pool just prior to loading and keeping on ice is highly recommended.
nucacidhunter is offline   Reply With Quote
Old 07-21-2017, 06:20 AM   #98
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

This is kind of tangential to NovaSeq, but...

I've suggested that we keep everything on ice whenever possible prior to sequencing, due to the fact that low temperatures retard any kind of activity and thus should inhibit adapter-swapping (which is a huge problem as we run a lot of highly-amplified single cells). But my explanations were too vague to be taken seriously, since I don't know the specifics of the reactions. I would love to have a very clear (and preferably lengthy, rather than concise) explanation of exactly why and when keeping pools on ice should prevent crosstalk, that I can copy and paste (attributing credit, if desired) to the people in charge of making libraries.

I think it is obvious that the longer you let a mixed batch of libraries sit around, and the higher the temperature, the more index-swapping will occur, regardless of the mechanism. But without citing a specific mechanism (and it does not really matter if it is the dominant one), nobody involved with library prep will pay attention to my concerns on the issue (meaning, no tests of ice vs no ice). All I really need is a real mechanism, which seems sufficiently important to cause a test to be run; once that occurs, I'll be satisfied, even if the results are negative and indicate that keeping pooled libraries at a high temperature for a long time seems to be optimal for preventing crosstalk. Not that I'll believe negative results unless I run the experiment myself, but at least I'll believe I did my best. I'll still report the results here.

Last edited by Brian Bushnell; 07-21-2017 at 06:34 AM.
Brian Bushnell is offline   Reply With Quote
Old 07-21-2017, 07:15 AM   #99
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,317
Default

Quote:
Originally Posted by Brian Bushnell View Post
This is kind of tangential to NovaSeq, but...

I've suggested that we keep everything on ice whenever possible prior to sequencing, due to the fact that low temperatures retard any kind of activity and thus should inhibit adapter-swapping (which is a huge problem as we run a lot of highly-amplified single cells). But my explanations were too vague to be taken seriously, since I don't know the specifics of the reactions. I would love to have a very clear (and preferably lengthy, rather than concise) explanation of exactly why and when keeping pools on ice should prevent crosstalk, that I can copy and paste (attributing credit, if desired) to the people in charge of making libraries.

I think it is obvious that the longer you let a mixed batch of libraries sit around, and the higher the temperature, the more index-swapping will occur, regardless of the mechanism. But without citing a specific mechanism (and it does not really matter if it is the dominant one), nobody involved with library prep will pay attention to my concerns on the issue (meaning, no tests of ice vs no ice). All I really need is a real mechanism, which seems sufficiently important to cause a test to be run; once that occurs, I'll be satisfied, even if the results are negative and indicate that keeping pooled libraries at a high temperature for a long time seems to be optimal for preventing crosstalk. Not that I'll believe negative results unless I run the experiment myself, but at least I'll believe I did my best. I'll still report the results here.
Yeah, I'm more of a bench scientist by background. And until I saw nucacidhunter's post above I hadn't seen any plausible mechanism as to how purified Illumina amplicon libraries would "swap indexes" due to sitting around mixed together. That is, under normal conditions DNA is very nearly inert and stable. It doesn't recombine without the help of enzyme(s).

But I guess previous instantiations of ex-amp (HiSeq 4000/X) require the researcher to mix the "ex amp" reagent with the library pool prior to clustering on the cbot. If this reagent contains the polymerase and other reactants then it could indeed be responsible for the recommendation not to leave pools sitting around at room temp or at all.

The NovaSeq does only on-board clustering and so adds the ex-amp reagents to the denatured library pool itself. So the "letting libraries sit around as pool prohibition" should not be an issue for it. If this is one of the mechanisms of index-hopping...

--
Phillip
pmiguel is offline   Reply With Quote
Old 07-24-2017, 10:09 AM   #100
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,317
Default

Hmm, we just finished processing our first (training) NovaSeq run and I am seeing evidence of index hops at about 2000PPM (0.2%). Or is it 1.6%?

We ran 21 (non-mouse) fecal DNA environmental samples (no-PCR libraries, made using the 550 bp method with the TruSeq no amp kit) and 3 mouse RNAseq (Illumina TruSeq polyA+) libraries. All just using single indexes.

The assay we used to detect index hops in imperfect -- 1000 reads from each sample were blasted against genbank and software attempts to determine the species origin based on the blast search.

Works better for some species than others. For mouse RNA, generally >90% of reads come back identified as "mus musculus". But for sorghum genomic DNA, only about 50% of the reads come back identified as sorghum.

But, nevertheless I expect that >90% of mouse reads hopping into a non-mouse sample bin would be detected. In the 21 DNA library files we detected a range of 0-6 reads called by the software as "mus musculus" and that averages to 2% across 21 samples.

Not sure how to scale this though. There were a total of 24 samples, 21 environmental, 3 mouse RNA. The run demultiplexed to 4 billion environmental clusters and 0.5 billion mouse RNA sample clusters. In the 4 billion environmental reads 0.2% are mouse. So is that 0.2% index hopping rate? Or because there were 1/8th the number of clustered mouse amplicons as environmental amplicons should I multiply that figure by 8?

To get a mouse read in an environmental sample, it would be necessary for an index to be "donated" from a mouse sample to an environmental amplicon. In the end I only care to use the mouse sequence to identify the percentage of reads mis-assigned overall.

Okay, generally one is cautioned to move into numbers if percentages are misleading. 0.2% of 4 billion clusters 8x10^6 or 8 million mis-assigned clusters for the run. Those are the events I can detect. How many non-detected events would I project? Yeah, probably 1.6%.

These were made to run on the HiSeq (and they were).

--
Phillip
pmiguel is offline   Reply With Quote
Reply

Tags
illumina, novaseq

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:41 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO