Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Introducing CrossBlock, a BBTool for removing cross-contamination

    Illumina reads typically have short barcodes of around 8bp. This is fine when you are sequencing a couple of people with unamplified WGS on a full flowcell. However, Illumina platforms have a non-insignificant rate of misassigned barcodes. The reason for this is still not clear; I suspect that some of it is sequencing error, some is impure reagents, and some is adapters breaking off, floating around, and ligating to reads from the wrong library. Regardless, there are different rates of crosstalk on different platforms. HiSeq 2500 seems to have much higher crosstalk than NextSeq, but it's difficult to validate because different runs give different results. But currently, JGI is operating under the assumption that NextSeq gives the lowest crosstalk of all Illumina platforms, and JGI sequences crosstalk-sensitive things on NextSeq despite its much lower quality compared to HiSeq 2500.

    JGI does a lot of single-cell sequencing. These cells are lysed and MDA-amplified prior to sequencing; the result is an exponential range of coverage, which is very spiky. If you are just sequencing a single organism in a run, it doesn't matter. But, JGI sequences 92 individual single cells on a 96-well plate, all multiplexed together. If there is no crosstalk, that's fine; you get 92 kind of bad assemblies (hopefully 60% genome recovery for each well). But, there is a significant amount of crosstalk. This causes huge problems with assembly - even a 0.01% rate of crosstalk can result in 50% or more of non-target genome in your assembly, due to MDA's spikiness.

    0.01% crosstalk is not important when you multiplex 10 humans, and only care about heterozygous or homozygous calls (though of course it is still crucial when looking for low-allele-fraction variants). But for single-cell sequencing, it is deal-breaking. The current best single-cell assembler (for Illumina reads) is Spades. It can handle MDA bias, which will yield 1x coverage in some places, and 100,000x coverage in other places. That means that 0.01% crosstalk will give 10x coverage from a different, multiplexed sample, to all other samples. Meaning, they will all assemble the same contig, which was derived from some other organism. So, you get false results.

    This is a fundamental limitation of current technology. Reagents are impure (meaning, your adapters do not have 100% the barcodes you expect), sequencing platforms are inaccurate (Illumina base-calling is very sensitive to leading and trailing bases; with an 8-bp barcode, you basically get 6 "decent" bases) and, as far as I can tell, adapters do in fact break off and ligate to something else.

    There is no overall solution to this. However! If you are doing multiplexed single-cell sequencing on Illumina platforms, I can recommend this:

    1) Allow zero barcode mismatches when demultiplexing. This is absolutely crucial. You will, of course, and up with far more unbinned reads, but that's just the price of correctness.

    2) Use NextSeq. In our tests, it has yielded the lowest crosstalk rate of NextSeq/HiSeq2500/Miseq. The error rate is vastly higher than HiSeq2500, of course, but in this situation crosstalk is more important.

    3) Run CrossBlock. In synthetic tests, it eliminates 100% of contaminant contigs, with a false-positive removal rate of 0.03% (ignoring contigs under 500bp). This assumes that you multiplexed different organisms; with identical organisms, the false-positive rate will increase. Still, it can usually deal with 2-3 copies of an organism with no false positive removals. More than that is dicey. It will remove some contigs, but they will still be present somewhere. In practice, I have found that CrossBlock retains contigs somewhere (meaning, at least one copy of a sequence exists) even when there are 20 copies of the same organism.

    What does CrossBlock do?

    It compares coverage of contigs from the library that generated them, to coverage from all other libraries. If the coverage from other libraries is dramatically higher, a contig is considered a contaminant. It's quite simple.

    When should you use CrossBlock?

    You should always use CrossBlock when dealing with different organisms, multiplexed together, where there is spiky coverage (such as single-cell, but possibly other situations).

    When should you not use CrossBlock?

    Most of the time! CrossBlock is only relevant to assembling novel genomes. If you are not doing assembly, don't use it. If you are not multiplexing different organisms, don't use it. Particularly, if you are multiplexing lots of things that might be the same organism... Don't use it; it can yield a lot of false-positive removals in that case. It's actually pretty good when you have 2-5 members of the same species on a plate. But if you already know you have a plate of 96 cells that are all different strains of the same species, don't use CrossBlock.
    Last edited by Brian Bushnell; 01-21-2017, 05:08 PM.

  • #2
    Interesting tool.
    Can you elaborate on what you mean by adapters breaking off and ligating to other libraries? Where and how do you think this happens?
    Josh Kinman

    Comment


    • #3
      It's still not really clear when or how this happens. But, it seems like the longer you let pooled libraries sit around, the more crosstalk occurs...

      We are using 2x150bp libraries, with 8bp barcodes; 8 unique left barcodes plus 12 unique right barcodes, for 96 total combinations. This is obviously not ideal and we will soon switch to 96 unique barcodes (in some cases, at least).

      Testing indicates that most of the cross-contamination comes from one or the other barcode being wrong. Considering every barcode pair is valid, aside from unexpected barcodes, this means there is a high rate of contaminant pairs. Are they a result of misreading barcodes? ...

      Well, no, these crosstalk reads occur at too high a rate for misreading barcodes to explain it. We allow zero mismatches (barcodes must have all 8 bases match exactly) and still get enough contaminant reads to assemble contigs from a different organism that happened to be pooled on the same plate. The only explanation I have been able to come up with is adapters with barcodes floating around and ligating to the wrong reads.

      Actually, I have done some mapping experiments and found that some of these pairs have read 1 and read 2 mapping to different samples, so chimerism seems to also be a contributing factor. But it does not appear to be the main factor.

      I have been trying to get our library-prep people to run an experiment to correlate library-prep time and temperature with cross-contamination rates (basically, test doing everything quickly and on ice), but no luck so far. Until someone tests that, I don't think it's possible to answer your question.

      Comment


      • #4
        Originally posted by Brian Bushnell View Post
        It's still not really clear when or how this happens. But, it seems like the longer you let pooled libraries sit around, the more crosstalk occurs...
        Hi Brian,
        I find it really hard to believe that the "longer you let pooled libraries sit around, the more crosstalk occurs" or " and, as far as I can tell, adapters do in fact break off and ligate to something else." I mean mechanistically it makes no sense. Where does the ligase come from?

        The sequencers you mention as being worse than NextSeq: HiSeq 2500 and MiSeq. They use on-board clustering, so some of your crosstalk may be run-to-run, rather than bar-code to bar-code. NextSeq has a bleach wash built in to its protocols, I understand. MiSeq has one available, but Illumina seems to be loath to recommend it.

        Also, with respect to "impure reagents". Yes, should your adapters become cross-contaminated prior to or during library construction, this would be another source of cross-contamination.

        --
        Phillip
        Last edited by pmiguel; 01-23-2017, 08:52 AM. Reason: Read initial post more thouroughly...

        Comment


        • #5
          Originally posted by pmiguel View Post
          Hi Brian,
          I find it really hard to believe that the "longer you let pooled libraries sit around, the more crosstalk occurs" or " and, as far as I can tell, adapters do in fact break off and ligate to something else." I mean mechanistically it makes no sense. Where does the ligase come from?
          Sorry! That's just stuff I'm randomly suggesting based on my limited observations, but I have no idea of what mechanisms might be at work. They might be completely spurious, and I don't have enough data to correlate cross-talk with time with high confidence. But, we have run multiple tests on the same libraries, and experienced things like - yay! Low crosstalk! We seem to have solved everything in library-prep. Ok, but let's rerun the same library with this slightly different setting, and... oops! Subsequently, the crosstalk got dramatically worse from the same library, when all it did was sit around. This happened a couple of times. I had an idea floating around in my head that reagents had not fully reacted prior to multiplexing, but perhaps this is impossible. Still, surely DNA can, under some conditions, given enough time, bond without the aid of enzymes, or else it never would have evolved in the first place... when sticky ends or blunt ends stick together, will sequencing necessarily fail without ligase?

          The sequencers you mention as being worse than NextSeq: HiSeq 2500 and MiSeq. They use on-board clustering, so some of your crosstalk may be run-to-run, rather than bar-code to bar-code. NextSeq has a bleach wash built in to its protocols, I understand. MiSeq has one available, but Illumina seems to be loath to recommend it.
          To clarify, I am strictly talking about crosstalk within a run, not between runs.

          Also, with respect to "impure reagents". Yes, should your adapters become cross-contaminated prior to or during library construction, this would be another source of cross-contamination.
          And again, another difficult one to validate. We tried to examine the purity of our reagents using mass-spec, but were not able to achieve useful results.

          Comment


          • #6
            Originally posted by Brian Bushnell View Post
            Ok, but let's rerun the same library with this slightly different setting, and... oops! Subsequently, the crosstalk got dramatically worse from the same library, when all it did was sit around. This happened a couple of times.
            Do you recall if it was on an identical model/same sequencer with everything else remaining the same?

            Comment


            • #7
              Same model; I can't remember about the remainder. We did a few different experiments on MiSeq, HiSeq, and NextSeq, and the NextSeq runs were definitely on the same platform because at the time we only had one.

              Comment


              • #8
                Originally posted by Brian Bushnell View Post
                Sorry! That's just stuff I'm randomly suggesting based on my limited observations, but I have no idea of what mechanisms might be at work. They might be completely spurious, and I don't have enough data to correlate cross-talk with time with high confidence. But, we have run multiple tests on the same libraries, and experienced things like - yay! Low crosstalk! We seem to have solved everything in library-prep. Ok, but let's rerun the same library with this slightly different setting, and... oops! Subsequently, the crosstalk got dramatically worse from the same library, when all it did was sit around. This happened a couple of times. I had an idea floating around in my head that reagents had not fully reacted prior to multiplexing, but perhaps this is impossible. Still, surely DNA can, under some conditions, given enough time, bond without the aid of enzymes, or else it never would have evolved in the first place... when sticky ends or blunt ends stick together, will sequencing necessarily fail without ligase?
                Okay, but Brian, if you are talking about reactions occurring in the primordial soup prior to the evolution of ligases, it would be occurring in a highly reducing atmosphere/environment and if it took 100 years to happen, there would be plenty of time.

                In all likelihood your colleagues are working in an oxidizing environment (our current atmosphere) and probably don't allow reactions to go for 100 years.

                So, under modern lab conditions it is safe to say that pieces of DNA do not ligate together in the absence of an enzyme to catalyze the reaction and energy, probably in the form of ATP, to drive the reaction forward.

                For most library prep methods the ligation step would happen before a clean-up (to remove undesired products) and a PCR reaction. Both of these should be sufficient to remove and/or denature an ligase protein and other reactants necessary to efficiently ligate DNA.

                Also, the ligase used is most likely T4 DNA ligase -- which has very poor single-stranded polynucleotide ligation capabilities.

                Originally posted by Brian Bushnell View Post
                To clarify, I am strictly talking about crosstalk within a run, not between runs.
                Are you sure? If you did a test run on an instrument that during its previous run had indexes that overlap with those you used for your test then you might expect to find as much as 1% contamination from that previous run. At least if you were using a MiSeq without a bleach wash, or a HiSeq Rapid run with on-board clustering.

                --
                Phillip

                Comment


                • #9
                  I've just done a nanopore run where we have designed an experiment to specifically look for chimeric reads (from an amplicon size of 600-1000bp). I've been getting about 0.5-1.5% of reads that appear to have ligated together post-barcoding (i.e. during the adapter ligation step), and a scattering of reads that appear to have ligated during sample loading or on the flow cell. It's possible that this "ligation" is actually a software issue (e.g. multiple reads appearing too quickly after each other during sequencing), so I need to do a bit more exploration of the raw sequence signal to confirm this.

                  This effect can be discovered quite easily from nanopore reads because you get the entire sequence, but what concerns me is how common it might be in sample prep using other platforms (e.g. Illumina), where it can't be properly quantified. If this spontaneous ligation is happening, I expect that it would be indistinguishable from the cross-talk that you mention here.

                  Comment


                  • #10
                    Originally posted by gringer View Post
                    This effect can be discovered quite easily from nanopore reads because you get the entire sequence, but what concerns me is how common it might be in sample prep using other platforms (e.g. Illumina), where it can't be properly quantified. If this spontaneous ligation is happening, I expect that it would be indistinguishable from the cross-talk that you mention here.
                    Right; this is our biggest problem (aside from the crosstalk itself, of course). There are so many possible causes of crosstalk, and we have so few tools to distinguish between them, that it's hard to even design experiments that will clearly identify or eliminate individual possibilities.

                    I have noticed, though, that when mapping reads from the unbinned output of a pool (read pairs with invalid barcodes or invalid barcode pairs) they had a much higher discordant pairing rate; specifically, read 1 mapping to one library's assembly and read 2 mapping to another library's assembly. This indicates some kind of chimerism occurring rather than carry-over contamination or error in the barcode-reading cycles.

                    P.S. It could also indicate cluster misassignment when clusters are too close, or a host of other things, but chimeric joins between reads from different libraries is one possibility.
                    Last edited by Brian Bushnell; 01-26-2017, 09:31 AM.

                    Comment


                    • #11
                      http://enseqlopedia.com/2016/12/inde...000-and-x-ten/ talks about index crosstalk with a mechanism proposed, but it only involves patterned flow cells.
                      Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM
                      • seqadmin
                        Techniques and Challenges in Conservation Genomics
                        by seqadmin



                        The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                        Avian Conservation
                        Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                        03-08-2024, 10:41 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, Yesterday, 06:37 PM
                      0 responses
                      8 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, Yesterday, 06:07 PM
                      0 responses
                      8 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-22-2024, 10:03 AM
                      0 responses
                      49 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-21-2024, 07:32 AM
                      0 responses
                      67 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X