SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Sam file smaller than fastq scami Bioinformatics 6 10-01-2015 06:25 AM
Read in mutiple gzipped FastQ files using R Nicolas_15 Bioinformatics 4 09-04-2015 02:47 PM
fastx quality trimmer and gzipped fastq balsampoplar Bioinformatics 4 03-10-2014 07:53 AM
Script for breaking large .fa files into smaller files of [N] sequences lac302 Bioinformatics 3 02-21-2014 05:49 PM
Split fastq into smaller files lorendarith Bioinformatics 10 12-13-2012 05:28 AM

Reply
 
Thread Tools
Old 12-05-2017, 05:58 AM   #81
silask
Junior Member
 
Location: Sitzerland

Join Date: Oct 2017
Posts: 4
Default

Thank you GenoMax for the quick answer.

Ok, I see. In the Process guide they don't talk of clumpify. If use clumpify before the trimming I don't have the problem of single-end - paired-end duplicates. However, clumpify sill needs that the reads are exact the same length, this is even more strange then the nucleotide at the end of the reads are likely gone trimmed away in the subsequent trim step.

On a test set with two paired end raw reads which are normally detected as duplicates, I can prevent marking the reads as duplicates, by only removing one nt from the end.
silask is offline   Reply With Quote
Old 12-05-2017, 06:11 AM   #82
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,578
Default

Quote:
On a test set with two paired end raw reads which are normally detected as duplicates, I can prevent marking the reads as duplicates, by only removing one nt from the end.
I am not sure exactly what you are referring to. Clumpify by default will allow two substations (errors if you will). If you want to do strict matching then use dupesubs=0. Can you include the command line options you are using?

Last edited by GenoMax; 12-05-2017 at 06:14 AM.
GenoMax is offline   Reply With Quote
Old 12-05-2017, 07:27 AM   #83
silask
Junior Member
 
Location: Sitzerland

Join Date: Oct 2017
Posts: 4
Default

Sorry. For example I have two reads, which are 250 and 251 nt long, and identical.
Clumpy doesn't mark them as duplicate even with dupesubs=2. I would say the reads are duplicates, what do you think?
silask is offline   Reply With Quote
Old 12-05-2017, 07:56 AM   #84
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,578
Default

Interesting point. I have always worked with data that was of uniform length. Based on what you have discovered clumpify does seem to have an underlying need/assumption that the reads are all equal length.

Two options come to mind:

1. You could trim that extra base off from end of the 251 bp reads to make them 250 bp by using bbduk.sh
2. You could try using dedupe.sh which can match subsequences.
Code:
dedupe.sh

Written by Brian Bushnell and Jonathan Rood
Last modified March 9, 2017

Description:  Accepts one or more files containing sets of sequences (reads or scaffolds).
Removes duplicate sequences, which may be specified to be exact matches, subsequences, or sequences within some percent identity.
Can also find overlapping sequences and group them into clusters.
Please read bbmap/docs/guides/DedupeGuide.txt for more information.

Usage:     dedupe.sh in=<file or stdin> out=<file or stdout>
GenoMax is offline   Reply With Quote
Reply

Tags
bbduk, bbmap, bbmerge, clumpify, compression, pigz, reformat, tadpole

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:59 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO