SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Sam file smaller than fastq scami Bioinformatics 6 10-01-2015 05:25 AM
Read in mutiple gzipped FastQ files using R Nicolas_15 Bioinformatics 4 09-04-2015 01:47 PM
fastx quality trimmer and gzipped fastq balsampoplar Bioinformatics 4 03-10-2014 06:53 AM
Script for breaking large .fa files into smaller files of [N] sequences lac302 Bioinformatics 3 02-21-2014 04:49 PM
Split fastq into smaller files lorendarith Bioinformatics 10 12-13-2012 04:28 AM

Reply
 
Thread Tools
Old 12-05-2017, 04:58 AM   #81
silask
Junior Member
 
Location: Sitzerland

Join Date: Oct 2017
Posts: 7
Default

Thank you GenoMax for the quick answer.

Ok, I see. In the Process guide they don't talk of clumpify. If use clumpify before the trimming I don't have the problem of single-end - paired-end duplicates. However, clumpify sill needs that the reads are exact the same length, this is even more strange then the nucleotide at the end of the reads are likely gone trimmed away in the subsequent trim step.

On a test set with two paired end raw reads which are normally detected as duplicates, I can prevent marking the reads as duplicates, by only removing one nt from the end.
silask is offline   Reply With Quote
Old 12-05-2017, 05:11 AM   #82
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,780
Default

Quote:
On a test set with two paired end raw reads which are normally detected as duplicates, I can prevent marking the reads as duplicates, by only removing one nt from the end.
I am not sure exactly what you are referring to. Clumpify by default will allow two substitutions (errors if you will). If you want to do strict matching then use dupesubs=0. Can you include the command line options you are using?

Last edited by GenoMax; 04-18-2018 at 03:24 AM.
GenoMax is offline   Reply With Quote
Old 12-05-2017, 06:27 AM   #83
silask
Junior Member
 
Location: Sitzerland

Join Date: Oct 2017
Posts: 7
Default

Sorry. For example I have two reads, which are 250 and 251 nt long, and identical.
Clumpy doesn't mark them as duplicate even with dupesubs=2. I would say the reads are duplicates, what do you think?
silask is offline   Reply With Quote
Old 12-05-2017, 06:56 AM   #84
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,780
Default

Interesting point. I have always worked with data that was of uniform length. Based on what you have discovered clumpify does seem to have an underlying need/assumption that the reads are all equal length.

Two options come to mind:

1. You could trim that extra base off from end of the 251 bp reads to make them 250 bp by using bbduk.sh
2. You could try using dedupe.sh which can match subsequences.
Code:
dedupe.sh

Written by Brian Bushnell and Jonathan Rood
Last modified March 9, 2017

Description:  Accepts one or more files containing sets of sequences (reads or scaffolds).
Removes duplicate sequences, which may be specified to be exact matches, subsequences, or sequences within some percent identity.
Can also find overlapping sequences and group them into clusters.
Please read bbmap/docs/guides/DedupeGuide.txt for more information.

Usage:     dedupe.sh in=<file or stdin> out=<file or stdout>
GenoMax is offline   Reply With Quote
Old 07-21-2018, 12:46 AM   #85
kokyriakidis
Member
 
Location: Thessaloniki, Greece

Join Date: Jul 2018
Posts: 12
Default

After clumpify command to remove duplicates using in1 in2 and out1 out2, it seems like it produces only 1 out, messing with the pipeline stream! Why does it happen?

./bbmap/clumpify.sh in1=./Preproccesing/${ERR}/${ERR}_1_1.fastq.gz in2=./Preproccesing/${ERR}/${ERR}_2_1.fastq.gz out1=./Preproccesing/${ERR}/${ERR}_1_optical.fastq.gz out2=./Preproccesing/${ERR}/${ERR}_2_optical.fastq.gz dedupe=true optical=true overwrite=true

------

Reset INTERLEAVED to false because paired input files were specified.
Set INTERLEAVED to false
Input is being processed as paired
Writing interleaved.
Made a comparator with k=31, seed=1, border=1, hashes=4
Time: 22.512 seconds.
Reads Processed: 13371k 593.99k reads/sec
Bases Processed: 1145m 50.88m bases/sec
Executing clump.KmerSort3 [in1=./Preproccesing/ERR522065/ERR522065_1_optical_clumpify_p1_temp%_10a607a7b7090ec6.fastq.gz, in2=, out=./Preproccesing/ERR522065/ERR522065_1_optical.fas
tq.gz, out2=, groups=11, ecco=f, addname=false, shortname=f, unpair=f, repair=false, namesort=false, ow=true]

------

java -Djava.library.path=/mnt/scratchdir/home/kyriakidk/KIWI/bbmap/jni/ -ea -Xmx33412m -Xms33412m -cp /mnt/scratchdir/home/kyriakidk/KIWI/bbmap/current/ jgi.BBDukF in1=./Preproccesi
ng/ERR522065/ERR522065_1_optical.fastq.gz in2=./Preproccesing/ERR522065/ERR522065_2_optical.fastq.gz
Executing jgi.BBDukF [in1=./Preproccesing/ERR522065/ERR522065_1_optical.fastq.gz, in2=./Preproccesing/ERR522065/ERR522065_2_optical.fastq.gz]
Version 38.11

No output stream specified. To write to stdout, please specify 'out=stdout.fq' or similar.
Exception in thread "main" java.lang.RuntimeException: Can't read file './Preproccesing/ERR522065/ERR522065_2_optical.fastq.gz'

Last edited by kokyriakidis; 07-21-2018 at 12:51 AM.
kokyriakidis is offline   Reply With Quote
Old 07-21-2018, 04:16 AM   #86
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,780
Default

It looks like out1= and out2= variables are not being correctly expanded. BBMap seems to think that your outputs are inputs (in1=./Preproccesing/ERR522065/ERR522065_1_optical.fastq.gz, in2=./Preproccesing/ERR522065/ERR522065_2_optical.fastq.gz). Are the input files in the correct directory with the right names?
GenoMax is offline   Reply With Quote
Old 07-21-2018, 04:21 AM   #87
kokyriakidis
Member
 
Location: Thessaloniki, Greece

Join Date: Jul 2018
Posts: 12
Default

Quote:
Originally Posted by GenoMax View Post
It looks like out1= and out2= variables are not being correctly expanded. BBMap seems to think that your outputs are inputs (in1=./Preproccesing/ERR522065/ERR522065_1_optical.fastq.gz, in2=./Preproccesing/ERR522065/ERR522065_2_optical.fastq.gz). Are the input files in the correct directory with the right names?
Yes! all files are in the same folder! Actually neither clumpify dedupe optical, nor filterbytile work. So I have to remove them in order to complete my pipeline...
kokyriakidis is offline   Reply With Quote
Old 07-21-2018, 05:31 AM   #88
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,780
Default

Are you using the latest version of BBMap? Have you tried to run a test with actual file names instead of shell variables?
GenoMax is offline   Reply With Quote
Old 07-23-2018, 08:19 AM   #89
kokyriakidis
Member
 
Location: Thessaloniki, Greece

Join Date: Jul 2018
Posts: 12
Default

I use the latest version of BBtools. I can't get it work
kokyriakidis is offline   Reply With Quote
Old 07-23-2018, 10:15 AM   #90
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,780
Default

I am able to do something like

Code:
for i in `ls -1 *_1*.fastq | sed 's/_1.fastq//'`; do clumpify.sh -Xmx10g in1=$i\_1.fastq in2=$i\_2.fastq out1=$i\_clu_1.fastq out2=$i\_clu_2.fastq; done
and have clumpify.sh produce two files. I am not sure why you are having trouble.
GenoMax is offline   Reply With Quote
Old 07-23-2018, 11:02 AM   #91
kokyriakidis
Member
 
Location: Thessaloniki, Greece

Join Date: Jul 2018
Posts: 12
Default

I am having trouble using clumpify with the parameters optical + dedupe, to remove optical duplicates. e.x. clumpify.sh in=temp.fq.gz out=clumped.fq.gz dedupe optical. Clumpify without these parameters works
kokyriakidis is offline   Reply With Quote
Reply

Tags
bbduk, bbmap, bbmerge, clumpify, compression, pigz, reformat, tadpole

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:10 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO