SEQanswers (
-   Bioinformatics (
-   -   Bias in unique molecular identifier usage (

sudders 05-13-2014 09:08 AM

Bias in unique molecular identifier usage
Hi All,

I'm analysing some iCLIP data generated following Koning et al, Jove 2011.

Included in the protocol is the usage of a 5 nucleotide unique molecular identifier (UMI), incorporated around the library barcode at the start of the read thus:


where B is a library barcode base and U is a UMI base.

I extracted and recorded the UMI for each read and after mapping deduplicated removing reads that mapped to the same location and had the same UMI sequence as another read.

I think that if the incorporation of a UMI into a read were completely random, we would expect that the number of reads in each sample with each UMI (after de-duplication) would be roughly equal with a binomial distribution, and at such high numbers should approximate a normal distribution. But this is not what I see. The distributions of UMI usage are much more like log-normal distributions than normal distributions.

Have other people seen this? What are the potential biases this could introduce into downstream analysis. It feels to me that as long as there is no interaction between fragment sequence and UMI sequence that this just means that effectively its like fewer independent UMIs were used, but I'd love to hear what other people think.

donquijotes 07-28-2015 09:02 AM

Hi sudders,

From my very limited experience with random barcodes, I've heard that this bias could be due to 2 reasons.
a) Synthesis bias. Some oligo companies suggest manually mixing the 4 bases when they do random synthesis. One rep once told me that they have seen 20-30% variation/bias in base incorporation with automatic mixing.
b) Ligation bias. Some bases/sequences ligate less efficiently to your library.

BTW, I would like to design a 5 random barcode and have it in both ends of my DNA library. Do you know of any open source pipeline out there that will help me get rid of the PCR duplicates using both paired end barcodes? (1 million combinations total)

nucacidhunter 07-28-2015 06:25 PM

An easy option is to use indexes with 6 bases from a well-established kit and add 6 random bases to follow index read. By doing a 12 cycle index read one can identify PCR dups based on those 6 random bases. For more diverse UMI one can add 8 random bases and increase sequencing index to 14 cycles. Obviously, this is applicable to Y adapters and library must be amplified with short P5 and P7 primers.

donquijotes 07-30-2015 06:46 AM

Hi nucacidhunter,

Thank you for your input. What I've seen many people do out there is read the UMI as part of their DNA library insert and not as part of the index. I still don't know how Agilent does this with their HaloPlex HS.

I saw few software out there that can find the UMI and add it in the header (TagDust2 and MiGec for example) but I don't have a clue how to proceed from that point and get rid of PCR duplicates...

Any idea? I'll start a thread since I am rather clueless with the whole procedure.

mikesh 08-11-2015 12:01 PM

Hi donquijotes,

Sorry for a late reply, I'm not checking this forum very often.

First, in our practice we integrate UMIs using RT-PCR template-switching. We don't see a severe synthesis biases in UMI sequences (see figure B). Note that there is another good study covering possible biases in UMI-based sequencing (see

Indeed, UMI usage distribution is log-normal. We observe it in all our datasets and explain it by PCR amplification. Once you append read mapping position to you UMI in header, you can assemble consensus sequences and forget about raw reads, counting only UMIs.

Unfortunately it is not possible to assemble reads in MIGEC based on UMI+position as it was designed for amplicon libraries.

With 10-12bp the diversity of UMIs would be 10^5 - 1.7x10^7. If you estimate it to be >> number of starting molecules, you can simply run "Checkout" and "Assemble" routines of MIGEC (see docs here) to get a list of assembled consensuses.

Hope this helps,

sudders 11-23-2015 02:30 AM

I just though people might like to know that we never did get to the bottom of the baised UMI usage, but we did find an even bigger problem - PCR and sequencing errors in UMIs.

We've created some tools for dealing with UMIs - they can process fastq files to move the UMI sequence from the read to the name pre-mapping and then a tool that implements a number of schemes for error aware deduplication post mapping.



All times are GMT -8. The time now is 07:21 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.