Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Bias in unique molecular identifier usage

    Hi All,

    I'm analysing some iCLIP data generated following Koning et al, Jove 2011.

    Included in the protocol is the usage of a 5 nucleotide unique molecular identifier (UMI), incorporated around the library barcode at the start of the read thus:

    UUUBBBBUU

    where B is a library barcode base and U is a UMI base.

    I extracted and recorded the UMI for each read and after mapping deduplicated removing reads that mapped to the same location and had the same UMI sequence as another read.

    I think that if the incorporation of a UMI into a read were completely random, we would expect that the number of reads in each sample with each UMI (after de-duplication) would be roughly equal with a binomial distribution, and at such high numbers should approximate a normal distribution. But this is not what I see. The distributions of UMI usage are much more like log-normal distributions than normal distributions.

    Have other people seen this? What are the potential biases this could introduce into downstream analysis. It feels to me that as long as there is no interaction between fragment sequence and UMI sequence that this just means that effectively its like fewer independent UMIs were used, but I'd love to hear what other people think.

  • #2
    Hi sudders,

    From my very limited experience with random barcodes, I've heard that this bias could be due to 2 reasons.
    a) Synthesis bias. Some oligo companies suggest manually mixing the 4 bases when they do random synthesis. One rep once told me that they have seen 20-30% variation/bias in base incorporation with automatic mixing.
    b) Ligation bias. Some bases/sequences ligate less efficiently to your library.

    BTW, I would like to design a 5 random barcode and have it in both ends of my DNA library. Do you know of any open source pipeline out there that will help me get rid of the PCR duplicates using both paired end barcodes? (1 million combinations total)

    Comment


    • #3
      An easy option is to use indexes with 6 bases from a well-established kit and add 6 random bases to follow index read. By doing a 12 cycle index read one can identify PCR dups based on those 6 random bases. For more diverse UMI one can add 8 random bases and increase sequencing index to 14 cycles. Obviously, this is applicable to Y adapters and library must be amplified with short P5 and P7 primers.
      Last edited by nucacidhunter; 07-28-2015, 07:09 PM.

      Comment


      • #4
        Hi nucacidhunter,

        Thank you for your input. What I've seen many people do out there is read the UMI as part of their DNA library insert and not as part of the index. I still don't know how Agilent does this with their HaloPlex HS.

        I saw few software out there that can find the UMI and add it in the header (TagDust2 and MiGec for example) but I don't have a clue how to proceed from that point and get rid of PCR duplicates...

        Any idea? I'll start a thread since I am rather clueless with the whole procedure.

        Comment


        • #5
          Hi donquijotes,

          Sorry for a late reply, I'm not checking this forum very often.

          First, in our practice we integrate UMIs using RT-PCR template-switching. We don't see a severe synthesis biases in UMI sequences (see http://www.jimmunol.org/content/194/...ml?with-ds=yes figure B). Note that there is another good study covering possible biases in UMI-based sequencing (see http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3562004/).

          Indeed, UMI usage distribution is log-normal. We observe it in all our datasets and explain it by PCR amplification. Once you append read mapping position to you UMI in header, you can assemble consensus sequences and forget about raw reads, counting only UMIs.

          Unfortunately it is not possible to assemble reads in MIGEC based on UMI+position as it was designed for amplicon libraries.

          With 10-12bp the diversity of UMIs would be 10^5 - 1.7x10^7. If you estimate it to be >> number of starting molecules, you can simply run "Checkout" and "Assemble" routines of MIGEC (see docs here) to get a list of assembled consensuses.

          Hope this helps,
          Mike
          Last edited by mikesh; 08-11-2015, 11:04 AM.

          Comment


          • #6
            I just though people might like to know that we never did get to the bottom of the baised UMI usage, but we did find an even bigger problem - PCR and sequencing errors in UMIs.

            We've created some tools for dealing with UMIs - they can process fastq files to move the UMI sequence from the read to the name pre-mapping and then a tool that implements a number of schemes for error aware deduplication post mapping.

            See https://github.com/CGATOxford/UMI-tools

            Ian
            ---

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM
            • seqadmin
              Techniques and Challenges in Conservation Genomics
              by seqadmin



              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

              Avian Conservation
              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
              03-08-2024, 10:41 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Yesterday, 06:37 PM
            0 responses
            10 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, Yesterday, 06:07 PM
            0 responses
            9 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-22-2024, 10:03 AM
            0 responses
            51 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-21-2024, 07:32 AM
            0 responses
            67 views
            0 likes
            Last Post seqadmin  
            Working...
            X