SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   Bioinformatics (http://seqanswers.com/forums/forumdisplay.php?f=18)
-   -   Help with UMI (unique molecular identifiers) data processing (http://seqanswers.com/forums/showthread.php?t=61635)

donquijotes 07-30-2015 07:01 AM

Help with UMI (unique molecular identifiers) data processing
 
I've been browsing different papers and publications and trying to figure out what's the best way to analyze data with UMIs.
So far I have used GATK to do some analysis couple of times, but other than that I was mostly playing with alternative splicing analysis so I'm rather new to this CNV calling with UMIs topic and area of research.
What I would like to do is have the following design.

adaptor-UMI-DNAlibraryINSERT-UMI-adaptor

The UMIs will be 5 random bases on each side.

I get the whole UMI thinking and analysis but what I haven't found yet is the software to do such analysis. I've seen few tools to mark/find UMIs and put them on the header of the fastq sequence but then what? How do you bin and get rid of the true PCR duplicates? Does Picard have a function for it? If I have to write my own code then I'm out of luck lol.

I know Agilent supports UMIs with their Haloplex HS kits and their Surecall software that is mostly (from what I've heard) a nice GATK GUI.

Any help and guidance would be much appreciated. Newbies have the right to learn too, right?

Thank you in advance

nucacidhunter 07-30-2015 05:24 PM

Product described in web page below uses Molecular Indexing and the sequences are given in product manual.
http://www.biooscientific.com/Next-G...x-qRNA-Seq-Kit

They have described analysis step in a link in page below:
http://www.biooscientific.com/Next-G...x-qRNA-Seq-Kit

charlescoldroom 09-29-2015 06:40 AM

I am also interested to know about how to handle UMIs and remove duplicated reads based on UMIs.

I am using modified primers to have amplicon pools.

Which tools are there to mark/find UMIs and put them on the header of the fastq sequence? How could I then process the reads?

I have tried looking around, but I could not find any good step-by-step explanation, even papers just mention that they do the analysis but do not explain how.

Thanks!

danwiththeplan 10-08-2015 06:08 PM

Molecular indexing
 
Hi, you could try the script mentioned here:

http://www.biooscientific.com/Portal...A-Analysis.pdf

It's currently not working for me, but I'm in communication with the maintainer so I'll repost if I get everything working.

luc 10-08-2015 06:15 PM

A very simple approach would be to do a general de-duplification of the reads with BBTools (I have not used it for thispurpose but it should be better than our in house script) which will likely require a considerable memory. Then you should trim the 5 random bases.

charlescoldroom 10-12-2015 06:48 AM

Thanks guys, I will check out the suggestions!

IanSudbery 11-23-2015 02:21 AM

I know this most is a few months old now, but you might like to try our UMI-tools package, which offers a range different algorithms for deduplicating UMI sequences.

https://github.com/CGATOxford/UMI-tools

danwiththeplan 11-23-2015 01:26 PM

Quote:

Originally Posted by IanSudbery (Post 185204)
I know this most is a few months old now, but you might like to try our UMI-tools package, which offers a range different algorithms for deduplicating UMI sequences.

https://github.com/CGATOxford/UMI-tools

Hi, thanks for this contribution..

I'm reading the code, and this is what it looks like to me, but am I correct in saying that this script would correctly deduplicate splice-aware mappings ? i.e. reads that jump across splice boundaries are handled correctly?

sudders 03-14-2016 03:22 AM

Quote:

Originally Posted by danwiththeplan (Post 185272)
Hi, thanks for this contribution..

I'm reading the code, and this is what it looks like to me, but am I correct in saying that this script would correctly deduplicate splice-aware mappings ? i.e. reads that jump across splice boundaries are handled correctly?


You've probably worked this out already, but yes, it handles splice-aware mappings.

medalofhonour 05-10-2017 01:16 PM

This group recently published a paper with a pipeline for analyzing UMI datasets. The software can be found here :

https://github.com/mikessh/mageri

cement_head 10-30-2017 08:41 AM

If you are using CLC Genomics Workbench:

https://www.qiagenbioinformatics.com...cular-indexing

Strandlife 11-07-2017 01:40 AM

You should try Strand NGS for UMI protocols.
Strand NGS is the only software to provide comprehensive and end-to-end support for multi Unique Molecular Identifier Protocols

Few features includes:

1. Protocol diversity. Strand NGS supports data analysis from UMI protocols
i. Qiagen GeneRead®
ii. Archer VariantPlex®
iii. Rubicon Thruplex®
iv. Bioo Scientific NextFlex®)
v. A robust interface to specify custom UMIs

2. End-to-end or point-to-point. Users can go from reads to variants, can start at aligned BAMs containing the BC tag, or start/end at any reasonable point in the alignment/analysis workflow.

3. Workflow diversity. Strand NGS supports UMI protocols in DNA-, RNA- and small RNA-Seq workflows

4. Somatic- and UMI-ready visualizations. The genome browser visualizes consensus read lists. Each read contains UMI-related metadata, such as family size, UMI and mate UMI. A filter allows the easy exclusion of wild-type reads. This is useful at high sequencing depths and low allele frequencies, typical of data from somatic/tumor samples.

You could get a 20-day free trial by registering here with your organization email id:
http://www.strand-ngs.com/signup/freetrial

chen@haplox.com 12-01-2017 11:59 PM

You can use fastp to preprocess UMI from fastq.


All times are GMT -8. The time now is 02:01 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.