SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa



Similar Threads
Thread Thread Starter Forum Replies Last Post
duplicate reads removal vasvale Bioinformatics 19 01-08-2015 12:59 AM
Alignment to a set of custom reference sequences along with standard genome reference eeyun Bioinformatics 4 05-08-2013 04:06 PM
duplicate reads removal for amplicon libraries? ngseq Illumina/Solexa 0 12-10-2012 12:50 PM
PCR duplicate removal for whole genome sequencing vs. whole exome sequencing cliff Bioinformatics 1 09-27-2011 07:29 AM
threshold for duplicate removal? mard Bioinformatics 2 03-21-2010 03:45 PM

Reply
 
Thread Tools
Old 10-24-2013, 09:11 AM   #1
curious.genome
Member
 
Location: USA

Join Date: Oct 2013
Posts: 10
Default Duplicate removal without alignment to reference genome

Hi guys,

So our HiSeq data is showing a large number of duplicate sequences. I've come across tools like Picard MarkDups or samtools rmdup which remove duplicates - however they seem to require alignment to a reference genome and use position information to perform the removal.

Is there some way of performing duplicate removal without using alignment to a reference? (since we don't have a reference!) A naive pair-wise comparison of all sequences to each other would probably take too much time, and not account for localized errors as well, correct ? Should I use a hashtable to store all the sequences and then perform a constant time lookup for each sequence ? Or am I missing an easy way of doing this ?


Thanks!
curious.genome is offline   Reply With Quote
Old 10-24-2013, 09:13 AM   #2
vivek_
PhD Student
 
Location: Denmark

Join Date: Jul 2012
Posts: 164
Default

You could try a tool like fastx collapser. It would collapse similar sequences into unique reads and give the count of each sequence in the header
vivek_ is offline   Reply With Quote
Old 10-24-2013, 09:17 AM   #3
mcnelson.phd
Senior Member
 
Location: Connecticut

Join Date: Jul 2011
Posts: 162
Default

I believe that fastx toolkit can do that with the collapser sub command, but it will only identify 100% matches. Mothur also has a similar command, uniq.seqs but again only looks at 100% identity.
mcnelson.phd is offline   Reply With Quote
Old 10-24-2013, 09:24 AM   #4
curious.genome
Member
 
Location: USA

Join Date: Oct 2013
Posts: 10
Default

That was quick! Thanks Vivek, I'll try fastx collapser.

From a quick google search, it looks like this tool can only collapse completely identical reads. I might still want to figure out how to deal with a situation where two reads differ only by 1-2 nucleotides due to error - I would want to collapse these 'almost-duplicates' as well.


Also, just out of curiosity, in principle the fastx-collapser could do the duplicate removal job for everyone right, regardless of whether they have access to a reference genome or not ? Why were the tools MarkDups and rmdup created anyway ? They have an additional time consuming alignment step as well.
curious.genome is offline   Reply With Quote
Old 10-24-2013, 09:26 AM   #5
curious.genome
Member
 
Location: USA

Join Date: Oct 2013
Posts: 10
Default

Quote:
Originally Posted by mcnelson.phd View Post
I believe that fastx toolkit can do that with the collapser sub command, but it will only identify 100% matches. Mothur also has a similar command, uniq.seqs but again only looks at 100% identity.
Thanks mcnelson, I missed your reply. I'l have a go at mothur too, but like you said, I need a fix for the sub 100% match as well.
curious.genome is offline   Reply With Quote
Old 10-24-2013, 09:46 AM   #6
mcnelson.phd
Senior Member
 
Location: Connecticut

Join Date: Jul 2011
Posts: 162
Default

Quote:
Originally Posted by curious.genome View Post
I might still want to figure out how to deal with a situation where two reads differ only by 1-2 nucleotides due to error - I would want to collapse these 'almost-duplicates' as well.


Also, just out of curiosity, in principle the fastx-collapser could do the duplicate removal job for everyone right, regardless of whether they have access to a reference genome or not ? Why were the tools MarkDups and rmdup created anyway ? They have an additional time consuming alignment step as well.
The answer to both those questions gets at the root of why you want to remove duplicate reads. There are two types of duplicate reads in sequencing libraries, those that are true biological duplicates and those that are due to PCR (or some other bias in library preparation). The reason why some tools utilize a mapping step is to identify if reads are biological duplicates or PCR duplicates. By mapping to a reference, you can then look at a coverage map and determine if a sequence is represented multiple times in your genome/transcriptome/etc and that's why you have duplicate reads or if it's simply b/c some reads were PCR amplified more efficiently.

On the basis of that, the question I'd pose to you is why you want to identify and remove reads that differ by 1-2 nt? Depending on what you plan on doing with your data, you don't want to remove those reads (or duplicates either for that matter). If you really want to go that route though, you can treat your data like it's amplicon data and basically perform sequence clustering with an identity cutoff that would reflect 1 or 2 nt changes over the length of your reads (e.g 0.99 for 1 base difference in 100bp reads). The best option that I can recommend for doing that is probably usearch since it's faster on HiSeq sized datasets compared to other methods, but there are a number of programs (including mothur) that can do that for you.
mcnelson.phd is offline   Reply With Quote
Old 10-24-2013, 11:13 AM   #7
curious.genome
Member
 
Location: USA

Join Date: Oct 2013
Posts: 10
Default

Quote:
Originally Posted by mcnelson.phd View Post
The answer to both those questions gets at the root of why you want to remove duplicate reads. There are two types of duplicate reads in sequencing libraries, those that are true biological duplicates and those that are due to PCR (or some other bias in library preparation). The reason why some tools utilize a mapping step is to identify if reads are biological duplicates or PCR duplicates. By mapping to a reference, you can then look at a coverage map and determine if a sequence is represented multiple times in your genome/transcriptome/etc and that's why you have duplicate reads or if it's simply b/c some reads were PCR amplified more efficiently.

On the basis of that, the question I'd pose to you is why you want to identify and remove reads that differ by 1-2 nt? Depending on what you plan on doing with your data, you don't want to remove those reads (or duplicates either for that matter). If you really want to go that route though, you can treat your data like it's amplicon data and basically perform sequence clustering with an identity cutoff that would reflect 1 or 2 nt changes over the length of your reads (e.g 0.99 for 1 base difference in 100bp reads). The best option that I can recommend for doing that is probably usearch since it's faster on HiSeq sized datasets compared to other methods, but there are a number of programs (including mothur) that can do that for you.


Thanks mcnelson ! For now, all I plan to do with the reads is to perform assembly. I am not interested in variants/etc. I planned on removing the duplicates to aid the assembler. Is this a bad idea ?
curious.genome is offline   Reply With Quote
Old 10-24-2013, 11:17 AM   #8
mcnelson.phd
Senior Member
 
Location: Connecticut

Join Date: Jul 2011
Posts: 162
Default

I'd advise against removing duplicates for your first time doing the assembly because the k-mer abundance profile may get skewed, which can cause your assemblies to be wrong. Most assemblers for Illumina data are very tolerant to errors, and they can do an amazing job with very poor input data. After your get your initial assembly, then it might be worth it to see if removing duplicate and/or erroneous reads will improve things, but what often happens is that you'll get a poorer assembly b/c you've reduced the volume of data you have to work with and this the assembler has a harder time trying to correctly resolve the graph.
mcnelson.phd is offline   Reply With Quote
Old 10-24-2013, 10:41 PM   #9
curious.genome
Member
 
Location: USA

Join Date: Oct 2013
Posts: 10
Default

Thanks for the sound advice, mcnelson. I'll go ahead with the assembly for now
curious.genome is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:25 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO