Seqanswers Leaderboard Ad

**vivek_** · 10-24-2013, 09:13 AM

You could try a tool like fastx collapser. It would collapse similar sequences into unique reads and give the count of each sequence in the header

**mcnelson.phd** · 10-24-2013, 09:17 AM

I believe that fastx toolkit can do that with the collapser sub command, but it will only identify 100% matches. Mothur also has a similar command, uniq.seqs but again only looks at 100% identity.

**curious.genome** · 10-24-2013, 09:24 AM

That was quick! Thanks Vivek, I'll try fastx collapser.

From a quick google search, it looks like this tool can only collapse completely identical reads. I might still want to figure out how to deal with a situation where two reads differ only by 1-2 nucleotides due to error - I would want to collapse these 'almost-duplicates' as well.

Also, just out of curiosity, in principle the fastx-collapser could do the duplicate removal job for everyone right, regardless of whether they have access to a reference genome or not ? Why were the tools MarkDups and rmdup created anyway ? They have an additional time consuming alignment step as well.

**curious.genome** · 10-24-2013, 09:26 AM

Originally posted by mcnelson.phd View Post

I believe that fastx toolkit can do that with the collapser sub command, but it will only identify 100% matches. Mothur also has a similar command, uniq.seqs but again only looks at 100% identity.

Thanks mcnelson, I missed your reply. I'l have a go at mothur too, but like you said, I need a fix for the sub 100% match as well.

**mcnelson.phd** · 10-24-2013, 09:46 AM

Originally posted by curious.genome View Post

I might still want to figure out how to deal with a situation where two reads differ only by 1-2 nucleotides due to error - I would want to collapse these 'almost-duplicates' as well.

Also, just out of curiosity, in principle the fastx-collapser could do the duplicate removal job for everyone right, regardless of whether they have access to a reference genome or not ? Why were the tools MarkDups and rmdup created anyway ? They have an additional time consuming alignment step as well.

The answer to both those questions gets at the root of why you want to remove duplicate reads. There are two types of duplicate reads in sequencing libraries, those that are true biological duplicates and those that are due to PCR (or some other bias in library preparation). The reason why some tools utilize a mapping step is to identify if reads are biological duplicates or PCR duplicates. By mapping to a reference, you can then look at a coverage map and determine if a sequence is represented multiple times in your genome/transcriptome/etc and that's why you have duplicate reads or if it's simply b/c some reads were PCR amplified more efficiently.

On the basis of that, the question I'd pose to you is why you want to identify and remove reads that differ by 1-2 nt? Depending on what you plan on doing with your data, you don't want to remove those reads (or duplicates either for that matter). If you really want to go that route though, you can treat your data like it's amplicon data and basically perform sequence clustering with an identity cutoff that would reflect 1 or 2 nt changes over the length of your reads (e.g 0.99 for 1 base difference in 100bp reads). The best option that I can recommend for doing that is probably usearch since it's faster on HiSeq sized datasets compared to other methods, but there are a number of programs (including mothur) that can do that for you.

**curious.genome** · 10-24-2013, 11:13 AM

Originally posted by mcnelson.phd View Post

The answer to both those questions gets at the root of why you want to remove duplicate reads. There are two types of duplicate reads in sequencing libraries, those that are true biological duplicates and those that are due to PCR (or some other bias in library preparation). The reason why some tools utilize a mapping step is to identify if reads are biological duplicates or PCR duplicates. By mapping to a reference, you can then look at a coverage map and determine if a sequence is represented multiple times in your genome/transcriptome/etc and that's why you have duplicate reads or if it's simply b/c some reads were PCR amplified more efficiently.

On the basis of that, the question I'd pose to you is why you want to identify and remove reads that differ by 1-2 nt? Depending on what you plan on doing with your data, you don't want to remove those reads (or duplicates either for that matter). If you really want to go that route though, you can treat your data like it's amplicon data and basically perform sequence clustering with an identity cutoff that would reflect 1 or 2 nt changes over the length of your reads (e.g 0.99 for 1 base difference in 100bp reads). The best option that I can recommend for doing that is probably usearch since it's faster on HiSeq sized datasets compared to other methods, but there are a number of programs (including mothur) that can do that for you.

Thanks mcnelson ! For now, all I plan to do with the reads is to perform assembly. I am not interested in variants/etc. I planned on removing the duplicates to aid the assembler. Is this a bad idea ?

**mcnelson.phd** · 10-24-2013, 11:17 AM

I'd advise against removing duplicates for your first time doing the assembly because the k-mer abundance profile may get skewed, which can cause your assemblies to be wrong. Most assemblers for Illumina data are very tolerant to errors, and they can do an amazing job with very poor input data. After your get your initial assembly, then it might be worth it to see if removing duplicate and/or erroneous reads will improve things, but what often happens is that you'll get a poorer assembly b/c you've reduced the volume of data you have to work with and this the assembler has a harder time trying to correctly resolve the graph.

**curious.genome** · 10-24-2013, 10:41 PM

Thanks for the sound advice, mcnelson. I'll go ahead with the assembly for now

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 24 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 20 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Duplicate removal without alignment to reference genome

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News