Unconfigured Ad

**Brian Bushnell** · 10-01-2014, 01:29 PM

Deduplication of reads is possible with dedupe, but it requires a lot of memory (~1kb per read) so it is only practical with HiSeq data if you have high-memory nodes. If you have a reference, deduplication based on mapping will use much less memory, though it will also be much slower. And whether or not deduplication is a good idea depends on the experiment; i.e., probably not for quantitative RNA-seq, and also probably not for unamplified libraries in general.

I suggest removing adapters before deduplication since adapter removal is less sensitive to errors, and identical reads will remain identical after adapter removal. And I suggest contaminant removal before quality trimming, if you do so with BBDuk (kmer-based) because it will be more sensitive. You can add phix/other contaminant removal into step 7 - BBDuk can do contaminant removal and quality trimming in one pass, and the contaminants are looked for prior to trimming each read.

So, I would reorder it like this:

1. Initial quality control
2. Quality and/or length-based reads discarding; trimming/discarding of N-containing reads
3. Adapter removal
4. phix/contaminant removal
5. [optional] Error-correction
6. [optional] Deduplication (depending on experiment)
7. Merging of PE reads
8. Quality trimming + phix/contaminant removal
9. Contamination check

As for the order of quality trimming and merge, they could be done either way. But, BBMerge internally performs quality-trimming for reads that don't merge prior to quality trimming, so for that particular program, quality-trimming after merging tends to be better.

Note that after merging, you will have 2 sets of reads (merged and unmerged) that must be processed independently, since not all of them will successfully merge.

**TauOvermind** · 10-01-2014, 02:05 PM

Brian, thank you very much for your comprehensive answers and also for your BBtools package, it is very good and allows to do almost anything with NGS data.

Thank you for your advice regarding deduplication, but I will probably try to deduplicate my reads anyway, as I work with a metagenomic dataset and, due to a limiting step down the analysis pipeline, have to minimise the number of reads.

May I ask you to clarify the meaning of

Code:

trimbyoverlap

flag in bbduk? What is the mechanism of trimming adapters when this flag is set to true, does bbduk rely completely on reads' sequences in this situation or still use reference file? If I use it, will I be able to merge my PE reads afterwards?

Just wondering, are you planning to publish a paper describing the toolkit any time soon? I think that it deserves to be better known to researchers, I found it just by chance, looking for alternatives to nesoni and trimmomatic.

**Brian Bushnell** · 10-01-2014, 02:21 PM

Originally posted by TauOvermind View Post

Brian, thank you very much for your comprehensive answers and also for your BBtools package, it is very good and allows to do almost anything with NGS data.

You're welcome!

May I ask you to clarify the meaning of

Code:

trimbyoverlap

flag in bbduk? What is the mechanism of trimming adapters when this flag is set to true, does bbduk rely completely on reads' sequences in this situation or still use reference file? If I use it, will I be able to merge my PE reads afterwards?

"trimbyoverlap" or "tbo" is intended for use alongside a reference, when trimming normal PE fragment libraries. Normally, if you specify something like "k=25 mink=12" then adapters will be trimmed starting whenever a 25-mer match is detected; at the very end of the read, it will look for as little as a 12-mer match. But what if you have a read ending in only 2bp of adapter sequence? That will be missed. The "tbo" flag attempts to overlap the two reads. If you have 2x100bp reads, and they overlap perfectly in an orientation indicating a 98bp insert size, that means the last 2 bp of each read is adapter sequence. So, in practice, "k=25 mink=12" will not trim adapter sequences shorter than 12bp, but "tbo" can trim down to 1bp of adapter, as long as the reads have a low error rate so the overlap can be detected. I usually pair this with the "tpe" flag ("trimpairsequally") because with fragment libraries, the adapter is expected to be in the same position for each read; so if it is detected in one read but not the other (because one read has errors) they will still both get trimmed to the same length. Thus, the settings I generally use for paired fragment libraries (assuming the reads are interleaved):

bbduk.sh in=reads.fq out=trimmed.fq ref=adapters.fa ktrim=r k=25 mink=12 hdist=1 tbo tpe

You should NOT use the "tbo" or "tpe" flags for libraries with other protocols, such as long-mate-pair libraries where maybe one read contains an adapter and the other doesn't; it's just for normal paired libraries, and greatly increases sensitivity.

You can still merge the reads afterward. BBDuk will not actually merge the reads, it just finds overlaps to calculate the insert size.

Just wondering, are you planning to publish a paper describing the toolkit any time soon? I think that it deserves to be better known to researchers, I found it just by chance, looking for alternatives to nesoni and trimmomatic.

I certainly plan to publish it and hopefully will be able to do so soon (particularly since I have a collaborator now); it's mainly a problem of free time.

**TauOvermind** · 10-01-2014, 02:31 PM

Thank you, Brian, everything is clear now. Good luck with the paper!

**TauOvermind** · 10-02-2014, 07:39 AM

EDIT: Sorry, it seems that problem disappeared after updating to BBtools v33.54.

Brian, may I ask for your help once again? I am getting these strange errors when I run dedupe.sh (I use BBtools v33.51):

Code:

dedupe.sh in=1pW_interleaved.fastq out=1pW_dd.fastq outd=1pW_dupes.fastq ac=f interleaved=t overwrite=t
java -Djava.library.path=/home/calculon/PROGRAMS/bbmap/jni/ -ea -Xmx22936m -Xms22936m -cp /home/calculon/PROGRAMS/bbmap/current/ jgi.Dedupe in=1pW_interleaved.fastq out=1pW_dd.fastq outd=1pW_dupes.fastq ac=f interleaved=t overwrite=t
Executing jgi.Dedupe [in=1pW_interleaved.fastq, out=1pW_dd.fastq, outd=1pW_dupes.fastq, ac=f, interleaved=t, overwrite=t]

Set INTERLEAVED to true
Initial:
Memory: free=22687m, used=361m

Found 4 duplicates.
Finished exact matches.    Time: 0.357 seconds.
Memory: free=21124m, used=1924m

Input:                  	194776 reads 		48427883 bases.
Duplicates:             	4 reads (0.00%) 	836 bases (0.00%)     	39916 collisions.
Result:                 	194772 reads (100.00%) 	48427047 bases (100.00%)

Exception in thread "main" java.lang.AssertionError: 97386, 194772
	at jgi.Dedupe.addToArray(Dedupe.java:1331)
	at jgi.Dedupe.writeOutput(Dedupe.java:1342)
	at jgi.Dedupe.process2(Dedupe.java:531)
	at jgi.Dedupe.process(Dedupe.java:430)
	at jgi.Dedupe.main(Dedupe.java:57)

It seems that the script doesn't want to write the deduplicated interleaved PE .fastq file to the disk, but it works flawlessly when I use it to deduplicate just one of the PE files:

Code:

dedupe.sh in=1pW_R1_100000.fastq  ac=f interleaved=f overwrite=t
java -Djava.library.path=/home/calculon/PROGRAMS/bbmap/jni/ -ea -Xmx22931m -Xms22931m -cp /home/calculon/PROGRAMS/bbmap/current/ jgi.Dedupe in=1pW_R1_100000.fastq ac=f interleaved=f overwrite=t
Executing jgi.Dedupe [in=1pW_R1_100000.fastq, ac=f, interleaved=f, overwrite=t]

Set INTERLEAVED to false
Initial:
Memory: free=22803m, used=240m

Found 32 duplicates.
Finished exact matches.    Time: 0.251 seconds.
Memory: free=21360m, used=1683m

Input:                  	100000 reads 		24971771 bases.
Duplicates:             	32 reads (0.03%) 	8918 bases (0.04%)     	0 collisions.
Result:                 	99968 reads (99.97%) 	24962853 bases (99.96%)

Time:   			0.262 seconds.
Reads Processed:        100k 	381.80k reads/sec
Bases Processed:      24971k 	95.34m bases/sec

or when I omit setting output file:

Code:

dedupe.sh in=1pW_interleaved.fastq outd=1pW_dupes.fastq ac=f interleaved=t overwrite=t
java -Djava.library.path=/home/calculon/PROGRAMS/bbmap/jni/ -ea -Xmx22936m -Xms22936m -cp /home/calculon/PROGRAMS/bbmap/current/ jgi.Dedupe in=1pW_interleaved.fastq outd=1pW_dupes.fastq ac=f interleaved=t overwrite=t
Executing jgi.Dedupe [in=1pW_interleaved.fastq, outd=1pW_dupes.fastq, ac=f, interleaved=t, overwrite=t]

Set INTERLEAVED to true
Initial:
Memory: free=22687m, used=361m

Found 4 duplicates.
Finished exact matches.    Time: 0.367 seconds.
Memory: free=21124m, used=1924m

Input:                  	194776 reads 		48427883 bases.
Duplicates:             	4 reads (0.00%) 	836 bases (0.00%)     	39916 collisions.
Result:                 	194772 reads (100.00%) 	48427047 bases (100.00%)

Time:   			0.376 seconds.
Reads Processed:        194k 	517.84k reads/sec
Bases Processed:      48427k 	128.75m bases/sec

**Brian Bushnell** · 10-02-2014, 07:57 AM

Sorry about that; there was an incorrect assertion checking for the size of an array that assumed unpaired reads. It's fixed in the latest version (33.54). Alternately, you can add the "-da" flag to disable assertions, but I don't recommend that.

**TauOvermind** · 10-02-2014, 08:01 AM

Thank you, Brian, I updated to the latest version and everything works perfectly now.

Topics	Statistics	Last Post
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 12 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 46 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 106 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 125 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM

Unconfigured Ad

Illumina PE reads optimal pre-processing sequence

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News