SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Java problem with BBtool randomreads KaraJC Bioinformatics 6 12-01-2015 01:50 PM
QC approaches for catching multiplexed libraries cross contamination (via adapters)? NGSfan Illumina/Solexa 13 03-04-2015 04:09 AM
Low Level Cross-contamination on PGM madebeljak Ion Torrent 1 07-28-2014 12:16 PM
Sample Cross-Contamination MAdkisson General 1 07-31-2012 08:25 AM

Reply
 
Thread Tools
Old 01-21-2017, 01:03 PM   #1
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,668
Default Introducing CrossBlock, a BBTool for removing cross-contamination

Illumina reads typically have short barcodes of around 8bp. This is fine when you are sequencing a couple of people with unamplified WGS on a full flowcell. However, Illumina platforms have a non-insignificant rate of misassigned barcodes. The reason for this is still not clear; I suspect that some of it is sequencing error, some is impure reagents, and some is adapters breaking off, floating around, and ligating to reads from the wrong library. Regardless, there are different rates of crosstalk on different platforms. HiSeq 2500 seems to have much higher crosstalk than NextSeq, but it's difficult to validate because different runs give different results. But currently, JGI is operating under the assumption that NextSeq gives the lowest crosstalk of all Illumina platforms, and JGI sequences crosstalk-sensitive things on NextSeq despite its much lower quality compared to HiSeq 2500.

JGI does a lot of single-cell sequencing. These cells are lysed and MDA-amplified prior to sequencing; the result is an exponential range of coverage, which is very spiky. If you are just sequencing a single organism in a run, it doesn't matter. But, JGI sequences 92 individual single cells on a 96-well plate, all multiplexed together. If there is no crosstalk, that's fine; you get 92 kind of bad assemblies (hopefully 60% genome recovery for each well). But, there is a significant amount of crosstalk. This causes huge problems with assembly - even a 0.01% rate of crosstalk can result in 50% or more of non-target genome in your assembly, due to MDA's spikiness.

0.01% crosstalk is not important when you multiplex 10 humans, and only care about heterozygous or homozygous calls (though of course it is still crucial when looking for low-allele-fraction variants). But for single-cell sequencing, it is deal-breaking. The current best single-cell assembler (for Illumina reads) is Spades. It can handle MDA bias, which will yield 1x coverage in some places, and 100,000x coverage in other places. That means that 0.01% crosstalk will give 10x coverage from a different, multiplexed sample, to all other samples. Meaning, they will all assemble the same contig, which was derived from some other organism. So, you get false results.

This is a fundamental limitation of current technology. Reagents are impure (meaning, your adapters do not have 100% the barcodes you expect), sequencing platforms are inaccurate (Illumina base-calling is very sensitive to leading and trailing bases; with an 8-bp barcode, you basically get 6 "decent" bases) and, as far as I can tell, adapters do in fact break off and ligate to something else.

There is no overall solution to this. However! If you are doing multiplexed single-cell sequencing on Illumina platforms, I can recommend this:

1) Allow zero barcode mismatches when demultiplexing. This is absolutely crucial. You will, of course, and up with far more unbinned reads, but that's just the price of correctness.

2) Use NextSeq. In our tests, it has yielded the lowest crosstalk rate of NextSeq/HiSeq2500/Miseq. The error rate is vastly higher than HiSeq2500, of course, but in this situation crosstalk is more important.

3) Run CrossBlock. In synthetic tests, it eliminates 100% of contaminant contigs, with a false-positive removal rate of 0.03% (ignoring contigs under 500bp). This assumes that you multiplexed different organisms; with identical organisms, the false-positive rate will increase. Still, it can usually deal with 2-3 copies of an organism with no false positive removals. More than that is dicey. It will remove some contigs, but they will still be present somewhere. In practice, I have found that CrossBlock retains contigs somewhere (meaning, at least one copy of a sequence exists) even when there are 20 copies of the same organism.

What does CrossBlock do?

It compares coverage of contigs from the library that generated them, to coverage from all other libraries. If the coverage from other libraries is dramatically higher, a contig is considered a contaminant. It's quite simple.

When should you use CrossBlock?

You should always use CrossBlock when dealing with different organisms, multiplexed together, where there is spiky coverage (such as single-cell, but possibly other situations).

When should you not use CrossBlock?

Most of the time! CrossBlock is only relevant to assembling novel genomes. If you are not doing assembly, don't use it. If you are not multiplexing different organisms, don't use it. Particularly, if you are multiplexing lots of things that might be the same organism... Don't use it; it can yield a lot of false-positive removals in that case. It's actually pretty good when you have 2-5 members of the same species on a plate. But if you already know you have a plate of 96 cells that are all different strains of the same species, don't use CrossBlock.

Last edited by Brian Bushnell; 01-21-2017 at 04:08 PM.
Brian Bushnell is offline   Reply With Quote
Old 01-21-2017, 04:25 PM   #2
jdk787
josh kinman
 
Location: Austin

Join Date: Apr 2014
Posts: 54
Default

Interesting tool.
Can you elaborate on what you mean by adapters breaking off and ligating to other libraries? Where and how do you think this happens?
__________________
Josh Kinman
jdk787 is offline   Reply With Quote
Old 01-21-2017, 04:51 PM   #3
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,668
Default

It's still not really clear when or how this happens. But, it seems like the longer you let pooled libraries sit around, the more crosstalk occurs...

We are using 2x150bp libraries, with 8bp barcodes; 8 unique left barcodes plus 12 unique right barcodes, for 96 total combinations. This is obviously not ideal and we will soon switch to 96 unique barcodes (in some cases, at least).

Testing indicates that most of the cross-contamination comes from one or the other barcode being wrong. Considering every barcode pair is valid, aside from unexpected barcodes, this means there is a high rate of contaminant pairs. Are they a result of misreading barcodes? ...

Well, no, these crosstalk reads occur at too high a rate for misreading barcodes to explain it. We allow zero mismatches (barcodes must have all 8 bases match exactly) and still get enough contaminant reads to assemble contigs from a different organism that happened to be pooled on the same plate. The only explanation I have been able to come up with is adapters with barcodes floating around and ligating to the wrong reads.

Actually, I have done some mapping experiments and found that some of these pairs have read 1 and read 2 mapping to different samples, so chimerism seems to also be a contributing factor. But it does not appear to be the main factor.

I have been trying to get our library-prep people to run an experiment to correlate library-prep time and temperature with cross-contamination rates (basically, test doing everything quickly and on ice), but no luck so far. Until someone tests that, I don't think it's possible to answer your question.
Brian Bushnell is offline   Reply With Quote
Old 01-23-2017, 06:51 AM   #4
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,218
Default

Quote:
Originally Posted by Brian Bushnell View Post
It's still not really clear when or how this happens. But, it seems like the longer you let pooled libraries sit around, the more crosstalk occurs...
Hi Brian,
I find it really hard to believe that the "longer you let pooled libraries sit around, the more crosstalk occurs" or " and, as far as I can tell, adapters do in fact break off and ligate to something else." I mean mechanistically it makes no sense. Where does the ligase come from?

The sequencers you mention as being worse than NextSeq: HiSeq 2500 and MiSeq. They use on-board clustering, so some of your crosstalk may be run-to-run, rather than bar-code to bar-code. NextSeq has a bleach wash built in to its protocols, I understand. MiSeq has one available, but Illumina seems to be loath to recommend it.

Also, with respect to "impure reagents". Yes, should your adapters become cross-contaminated prior to or during library construction, this would be another source of cross-contamination.

--
Phillip

Last edited by pmiguel; 01-23-2017 at 07:52 AM. Reason: Read initial post more thouroughly...
pmiguel is offline   Reply With Quote
Old 01-23-2017, 10:04 AM   #5
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,668
Default

Quote:
Originally Posted by pmiguel View Post
Hi Brian,
I find it really hard to believe that the "longer you let pooled libraries sit around, the more crosstalk occurs" or " and, as far as I can tell, adapters do in fact break off and ligate to something else." I mean mechanistically it makes no sense. Where does the ligase come from?
Sorry! That's just stuff I'm randomly suggesting based on my limited observations, but I have no idea of what mechanisms might be at work. They might be completely spurious, and I don't have enough data to correlate cross-talk with time with high confidence. But, we have run multiple tests on the same libraries, and experienced things like - yay! Low crosstalk! We seem to have solved everything in library-prep. Ok, but let's rerun the same library with this slightly different setting, and... oops! Subsequently, the crosstalk got dramatically worse from the same library, when all it did was sit around. This happened a couple of times. I had an idea floating around in my head that reagents had not fully reacted prior to multiplexing, but perhaps this is impossible. Still, surely DNA can, under some conditions, given enough time, bond without the aid of enzymes, or else it never would have evolved in the first place... when sticky ends or blunt ends stick together, will sequencing necessarily fail without ligase?

Quote:
The sequencers you mention as being worse than NextSeq: HiSeq 2500 and MiSeq. They use on-board clustering, so some of your crosstalk may be run-to-run, rather than bar-code to bar-code. NextSeq has a bleach wash built in to its protocols, I understand. MiSeq has one available, but Illumina seems to be loath to recommend it.
To clarify, I am strictly talking about crosstalk within a run, not between runs.

Quote:
Also, with respect to "impure reagents". Yes, should your adapters become cross-contaminated prior to or during library construction, this would be another source of cross-contamination.
And again, another difficult one to validate. We tried to examine the purity of our reagents using mass-spec, but were not able to achieve useful results.
Brian Bushnell is offline   Reply With Quote
Old 01-23-2017, 11:27 AM   #6
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,435
Default

Quote:
Originally Posted by Brian Bushnell View Post
Ok, but let's rerun the same library with this slightly different setting, and... oops! Subsequently, the crosstalk got dramatically worse from the same library, when all it did was sit around. This happened a couple of times.
Do you recall if it was on an identical model/same sequencer with everything else remaining the same?
GenoMax is offline   Reply With Quote
Old 01-23-2017, 11:40 AM   #7
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,668
Default

Same model; I can't remember about the remainder. We did a few different experiments on MiSeq, HiSeq, and NextSeq, and the NextSeq runs were definitely on the same platform because at the time we only had one.
Brian Bushnell is offline   Reply With Quote
Old 01-24-2017, 11:49 AM   #8
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,218
Default

Quote:
Originally Posted by Brian Bushnell View Post
Sorry! That's just stuff I'm randomly suggesting based on my limited observations, but I have no idea of what mechanisms might be at work. They might be completely spurious, and I don't have enough data to correlate cross-talk with time with high confidence. But, we have run multiple tests on the same libraries, and experienced things like - yay! Low crosstalk! We seem to have solved everything in library-prep. Ok, but let's rerun the same library with this slightly different setting, and... oops! Subsequently, the crosstalk got dramatically worse from the same library, when all it did was sit around. This happened a couple of times. I had an idea floating around in my head that reagents had not fully reacted prior to multiplexing, but perhaps this is impossible. Still, surely DNA can, under some conditions, given enough time, bond without the aid of enzymes, or else it never would have evolved in the first place... when sticky ends or blunt ends stick together, will sequencing necessarily fail without ligase?
Okay, but Brian, if you are talking about reactions occurring in the primordial soup prior to the evolution of ligases, it would be occurring in a highly reducing atmosphere/environment and if it took 100 years to happen, there would be plenty of time.

In all likelihood your colleagues are working in an oxidizing environment (our current atmosphere) and probably don't allow reactions to go for 100 years.

So, under modern lab conditions it is safe to say that pieces of DNA do not ligate together in the absence of an enzyme to catalyze the reaction and energy, probably in the form of ATP, to drive the reaction forward.

For most library prep methods the ligation step would happen before a clean-up (to remove undesired products) and a PCR reaction. Both of these should be sufficient to remove and/or denature an ligase protein and other reactants necessary to efficiently ligate DNA.

Also, the ligase used is most likely T4 DNA ligase -- which has very poor single-stranded polynucleotide ligation capabilities.

Quote:
Originally Posted by Brian Bushnell View Post
To clarify, I am strictly talking about crosstalk within a run, not between runs.
Are you sure? If you did a test run on an instrument that during its previous run had indexes that overlap with those you used for your test then you might expect to find as much as 1% contamination from that previous run. At least if you were using a MiSeq without a bleach wash, or a HiSeq Rapid run with on-board clustering.

--
Phillip
pmiguel is offline   Reply With Quote
Old 01-26-2017, 07:55 AM   #9
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 794
Default

I've just done a nanopore run where we have designed an experiment to specifically look for chimeric reads (from an amplicon size of 600-1000bp). I've been getting about 0.5-1.5% of reads that appear to have ligated together post-barcoding (i.e. during the adapter ligation step), and a scattering of reads that appear to have ligated during sample loading or on the flow cell. It's possible that this "ligation" is actually a software issue (e.g. multiple reads appearing too quickly after each other during sequencing), so I need to do a bit more exploration of the raw sequence signal to confirm this.

This effect can be discovered quite easily from nanopore reads because you get the entire sequence, but what concerns me is how common it might be in sample prep using other platforms (e.g. Illumina), where it can't be properly quantified. If this spontaneous ligation is happening, I expect that it would be indistinguishable from the cross-talk that you mention here.
gringer is offline   Reply With Quote
Old 01-26-2017, 08:28 AM   #10
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,668
Default

Quote:
Originally Posted by gringer View Post
This effect can be discovered quite easily from nanopore reads because you get the entire sequence, but what concerns me is how common it might be in sample prep using other platforms (e.g. Illumina), where it can't be properly quantified. If this spontaneous ligation is happening, I expect that it would be indistinguishable from the cross-talk that you mention here.
Right; this is our biggest problem (aside from the crosstalk itself, of course). There are so many possible causes of crosstalk, and we have so few tools to distinguish between them, that it's hard to even design experiments that will clearly identify or eliminate individual possibilities.

I have noticed, though, that when mapping reads from the unbinned output of a pool (read pairs with invalid barcodes or invalid barcode pairs) they had a much higher discordant pairing rate; specifically, read 1 mapping to one library's assembly and read 2 mapping to another library's assembly. This indicates some kind of chimerism occurring rather than carry-over contamination or error in the barcode-reading cycles.

P.S. It could also indicate cluster misassignment when clusters are too close, or a host of other things, but chimeric joins between reads from different libraries is one possibility.

Last edited by Brian Bushnell; 01-26-2017 at 08:31 AM.
Brian Bushnell is offline   Reply With Quote
Old 03-03-2017, 10:23 PM   #11
SNPsaurus
Registered Vendor
 
Location: Eugene, OR

Join Date: May 2013
Posts: 415
Default

http://enseqlopedia.com/2016/12/inde...000-and-x-ten/ talks about index crosstalk with a mechanism proposed, but it only involves patterned flow cells.
__________________
Providing nextRAD genotyping services. http://snpsaurus.com
SNPsaurus is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:33 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO