SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Rename fastq seq ID with unique identifier 454rocks Bioinformatics 2 03-28-2012 01:29 PM
Mapping reference genome to ensembl identifier bnfoguy Bioinformatics 0 06-13-2011 07:04 PM
Solexa - same sequence but unique identifier Layla Bioinformatics 5 11-27-2009 06:08 AM

Reply
 
Thread Tools
Old 05-13-2014, 09:08 AM   #1
sudders
Member
 
Location: Sheffield, UK

Join Date: Dec 2011
Posts: 32
Default Bias in unique molecular identifier usage

Hi All,

I'm analysing some iCLIP data generated following Koning et al, Jove 2011.

Included in the protocol is the usage of a 5 nucleotide unique molecular identifier (UMI), incorporated around the library barcode at the start of the read thus:

UUUBBBBUU

where B is a library barcode base and U is a UMI base.

I extracted and recorded the UMI for each read and after mapping deduplicated removing reads that mapped to the same location and had the same UMI sequence as another read.

I think that if the incorporation of a UMI into a read were completely random, we would expect that the number of reads in each sample with each UMI (after de-duplication) would be roughly equal with a binomial distribution, and at such high numbers should approximate a normal distribution. But this is not what I see. The distributions of UMI usage are much more like log-normal distributions than normal distributions.

Have other people seen this? What are the potential biases this could introduce into downstream analysis. It feels to me that as long as there is no interaction between fragment sequence and UMI sequence that this just means that effectively its like fewer independent UMIs were used, but I'd love to hear what other people think.
sudders is offline   Reply With Quote
Old 07-28-2015, 09:02 AM   #2
donquijotes
Junior Member
 
Location: Michigan

Join Date: Jul 2015
Posts: 7
Default

Hi sudders,

From my very limited experience with random barcodes, I've heard that this bias could be due to 2 reasons.
a) Synthesis bias. Some oligo companies suggest manually mixing the 4 bases when they do random synthesis. One rep once told me that they have seen 20-30% variation/bias in base incorporation with automatic mixing.
b) Ligation bias. Some bases/sequences ligate less efficiently to your library.

BTW, I would like to design a 5 random barcode and have it in both ends of my DNA library. Do you know of any open source pipeline out there that will help me get rid of the PCR duplicates using both paired end barcodes? (1 million combinations total)
donquijotes is offline   Reply With Quote
Old 07-28-2015, 06:25 PM   #3
nucacidhunter
Senior Member
 
Location: Iran

Join Date: Jan 2013
Posts: 1,080
Default

An easy option is to use indexes with 6 bases from a well-established kit and add 6 random bases to follow index read. By doing a 12 cycle index read one can identify PCR dups based on those 6 random bases. For more diverse UMI one can add 8 random bases and increase sequencing index to 14 cycles. Obviously, this is applicable to Y adapters and library must be amplified with short P5 and P7 primers.

Last edited by nucacidhunter; 07-28-2015 at 08:09 PM.
nucacidhunter is offline   Reply With Quote
Old 07-30-2015, 06:46 AM   #4
donquijotes
Junior Member
 
Location: Michigan

Join Date: Jul 2015
Posts: 7
Default

Hi nucacidhunter,

Thank you for your input. What I've seen many people do out there is read the UMI as part of their DNA library insert and not as part of the index. I still don't know how Agilent does this with their HaloPlex HS.

I saw few software out there that can find the UMI and add it in the header (TagDust2 and MiGec for example) but I don't have a clue how to proceed from that point and get rid of PCR duplicates...

Any idea? I'll start a thread since I am rather clueless with the whole procedure.
donquijotes is offline   Reply With Quote
Old 08-11-2015, 12:01 PM   #5
mikesh
Member
 
Location: California

Join Date: Jul 2012
Posts: 29
Default

Hi donquijotes,

Sorry for a late reply, I'm not checking this forum very often.

First, in our practice we integrate UMIs using RT-PCR template-switching. We don't see a severe synthesis biases in UMI sequences (see http://www.jimmunol.org/content/194/...ml?with-ds=yes figure B). Note that there is another good study covering possible biases in UMI-based sequencing (see http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3562004/).

Indeed, UMI usage distribution is log-normal. We observe it in all our datasets and explain it by PCR amplification. Once you append read mapping position to you UMI in header, you can assemble consensus sequences and forget about raw reads, counting only UMIs.

Unfortunately it is not possible to assemble reads in MIGEC based on UMI+position as it was designed for amplicon libraries.

With 10-12bp the diversity of UMIs would be 10^5 - 1.7x10^7. If you estimate it to be >> number of starting molecules, you can simply run "Checkout" and "Assemble" routines of MIGEC (see docs here) to get a list of assembled consensuses.

Hope this helps,
Mike

Last edited by mikesh; 08-11-2015 at 12:04 PM.
mikesh is offline   Reply With Quote
Old 11-23-2015, 02:30 AM   #6
sudders
Member
 
Location: Sheffield, UK

Join Date: Dec 2011
Posts: 32
Default

I just though people might like to know that we never did get to the bottom of the baised UMI usage, but we did find an even bigger problem - PCR and sequencing errors in UMIs.

We've created some tools for dealing with UMIs - they can process fastq files to move the UMI sequence from the read to the name pre-mapping and then a tool that implements a number of schemes for error aware deduplication post mapping.

See https://github.com/CGATOxford/UMI-tools

Ian
---
sudders is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:00 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO