SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Extent of reduction during pre-processing cerebralrust Bioinformatics 4 05-01-2013 04:17 AM
Discussion on pre-processing Illumina data lflc Bioinformatics 1 10-17-2012 03:51 AM
Pre processing of Illumina data tusharbiot Bioinformatics 5 04-05-2012 09:03 AM
when do you pre-process Illumina reads before analysis? PFS Bioinformatics 15 04-28-2011 03:06 PM

Reply
 
Thread Tools
Old 10-01-2014, 11:10 AM   #1
TauOvermind
Member
 
Location: UK

Join Date: Jul 2012
Posts: 14
Default Illumina PE reads optimal pre-processing sequence

Hello, everyone. I am trying to establish an optimal sequence of Illumina PE reads pre-processing steps and I have come up with the following protocol so far (the best tool for the step, in my opinion, is given in parentheses):
  1. Initial quality control (FastQC)
  2. Quality and/or length-based reads discarding; trimming/discarding of N-containing reads (bbduk.sh from BBMap/BBtools package)
  3. Deduplication (dedupe.sh from BBMap/BBtools package)
  4. Adapters removal (bbduk.sh from BBMap/BBtools package)
  5. [optional] Error-correction (ecc.sh from BBMap/BBtools package)
  6. Merging of PE reads (bbmerge.sh from BBMap/BBtools package)
  7. Hard and/or soft quality trimming (bbduk.sh from BBMap/BBtools package)
  8. Contamination check (Fastq Screen, maybe bbduk.sh from BBMap/BBtools package)

I am not completely sure about the best order of several steps; for instance, what is it better to do first, deduplication or adapters removal? Also, isn't it better to conduct quality trimming first and then merge PE reads? Thank you in advance.
TauOvermind is offline   Reply With Quote
Old 10-01-2014, 01:29 PM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Deduplication of reads is possible with dedupe, but it requires a lot of memory (~1kb per read) so it is only practical with HiSeq data if you have high-memory nodes. If you have a reference, deduplication based on mapping will use much less memory, though it will also be much slower. And whether or not deduplication is a good idea depends on the experiment; i.e., probably not for quantitative RNA-seq, and also probably not for unamplified libraries in general.

I suggest removing adapters before deduplication since adapter removal is less sensitive to errors, and identical reads will remain identical after adapter removal. And I suggest contaminant removal before quality trimming, if you do so with BBDuk (kmer-based) because it will be more sensitive. You can add phix/other contaminant removal into step 7 - BBDuk can do contaminant removal and quality trimming in one pass, and the contaminants are looked for prior to trimming each read.

So, I would reorder it like this:

1. Initial quality control
2. Quality and/or length-based reads discarding; trimming/discarding of N-containing reads
3. Adapter removal
4. phix/contaminant removal
5. [optional] Error-correction
6. [optional] Deduplication (depending on experiment)
7. Merging of PE reads
8. Quality trimming + phix/contaminant removal
9. Contamination check

As for the order of quality trimming and merge, they could be done either way. But, BBMerge internally performs quality-trimming for reads that don't merge prior to quality trimming, so for that particular program, quality-trimming after merging tends to be better.

Note that after merging, you will have 2 sets of reads (merged and unmerged) that must be processed independently, since not all of them will successfully merge.
Brian Bushnell is offline   Reply With Quote
Old 10-01-2014, 02:05 PM   #3
TauOvermind
Member
 
Location: UK

Join Date: Jul 2012
Posts: 14
Default

Brian, thank you very much for your comprehensive answers and also for your BBtools package, it is very good and allows to do almost anything with NGS data.

Thank you for your advice regarding deduplication, but I will probably try to deduplicate my reads anyway, as I work with a metagenomic dataset and, due to a limiting step down the analysis pipeline, have to minimise the number of reads.

May I ask you to clarify the meaning of
Code:
trimbyoverlap
flag in bbduk? What is the mechanism of trimming adapters when this flag is set to true, does bbduk rely completely on reads' sequences in this situation or still use reference file? If I use it, will I be able to merge my PE reads afterwards?

Just wondering, are you planning to publish a paper describing the toolkit any time soon? I think that it deserves to be better known to researchers, I found it just by chance, looking for alternatives to nesoni and trimmomatic.
TauOvermind is offline   Reply With Quote
Old 10-01-2014, 02:21 PM   #4
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by TauOvermind View Post
Brian, thank you very much for your comprehensive answers and also for your BBtools package, it is very good and allows to do almost anything with NGS data.
You're welcome!

Quote:
May I ask you to clarify the meaning of
Code:
trimbyoverlap
flag in bbduk? What is the mechanism of trimming adapters when this flag is set to true, does bbduk rely completely on reads' sequences in this situation or still use reference file? If I use it, will I be able to merge my PE reads afterwards?
"trimbyoverlap" or "tbo" is intended for use alongside a reference, when trimming normal PE fragment libraries. Normally, if you specify something like "k=25 mink=12" then adapters will be trimmed starting whenever a 25-mer match is detected; at the very end of the read, it will look for as little as a 12-mer match. But what if you have a read ending in only 2bp of adapter sequence? That will be missed. The "tbo" flag attempts to overlap the two reads. If you have 2x100bp reads, and they overlap perfectly in an orientation indicating a 98bp insert size, that means the last 2 bp of each read is adapter sequence. So, in practice, "k=25 mink=12" will not trim adapter sequences shorter than 12bp, but "tbo" can trim down to 1bp of adapter, as long as the reads have a low error rate so the overlap can be detected. I usually pair this with the "tpe" flag ("trimpairsequally") because with fragment libraries, the adapter is expected to be in the same position for each read; so if it is detected in one read but not the other (because one read has errors) they will still both get trimmed to the same length. Thus, the settings I generally use for paired fragment libraries (assuming the reads are interleaved):

bbduk.sh in=reads.fq out=trimmed.fq ref=adapters.fa ktrim=r k=25 mink=12 hdist=1 tbo tpe

You should NOT use the "tbo" or "tpe" flags for libraries with other protocols, such as long-mate-pair libraries where maybe one read contains an adapter and the other doesn't; it's just for normal paired libraries, and greatly increases sensitivity.

You can still merge the reads afterward. BBDuk will not actually merge the reads, it just finds overlaps to calculate the insert size.

Quote:
Just wondering, are you planning to publish a paper describing the toolkit any time soon? I think that it deserves to be better known to researchers, I found it just by chance, looking for alternatives to nesoni and trimmomatic.
I certainly plan to publish it and hopefully will be able to do so soon (particularly since I have a collaborator now); it's mainly a problem of free time.

Last edited by Brian Bushnell; 10-01-2014 at 02:23 PM.
Brian Bushnell is offline   Reply With Quote
Old 10-01-2014, 02:31 PM   #5
TauOvermind
Member
 
Location: UK

Join Date: Jul 2012
Posts: 14
Default

Thank you, Brian, everything is clear now. Good luck with the paper!
TauOvermind is offline   Reply With Quote
Old 10-02-2014, 07:39 AM   #6
TauOvermind
Member
 
Location: UK

Join Date: Jul 2012
Posts: 14
Default

EDIT: Sorry, it seems that problem disappeared after updating to BBtools v33.54.

Brian, may I ask for your help once again? I am getting these strange errors when I run dedupe.sh (I use BBtools v33.51):

Code:
dedupe.sh in=1pW_interleaved.fastq out=1pW_dd.fastq outd=1pW_dupes.fastq ac=f interleaved=t overwrite=t
java -Djava.library.path=/home/calculon/PROGRAMS/bbmap/jni/ -ea -Xmx22936m -Xms22936m -cp /home/calculon/PROGRAMS/bbmap/current/ jgi.Dedupe in=1pW_interleaved.fastq out=1pW_dd.fastq outd=1pW_dupes.fastq ac=f interleaved=t overwrite=t
Executing jgi.Dedupe [in=1pW_interleaved.fastq, out=1pW_dd.fastq, outd=1pW_dupes.fastq, ac=f, interleaved=t, overwrite=t]

Set INTERLEAVED to true
Initial:
Memory: free=22687m, used=361m

Found 4 duplicates.
Finished exact matches.    Time: 0.357 seconds.
Memory: free=21124m, used=1924m

Input:                  	194776 reads 		48427883 bases.
Duplicates:             	4 reads (0.00%) 	836 bases (0.00%)     	39916 collisions.
Result:                 	194772 reads (100.00%) 	48427047 bases (100.00%)

Exception in thread "main" java.lang.AssertionError: 97386, 194772
	at jgi.Dedupe.addToArray(Dedupe.java:1331)
	at jgi.Dedupe.writeOutput(Dedupe.java:1342)
	at jgi.Dedupe.process2(Dedupe.java:531)
	at jgi.Dedupe.process(Dedupe.java:430)
	at jgi.Dedupe.main(Dedupe.java:57)

It seems that the script doesn't want to write the deduplicated interleaved PE .fastq file to the disk, but it works flawlessly when I use it to deduplicate just one of the PE files:

Code:
dedupe.sh in=1pW_R1_100000.fastq  ac=f interleaved=f overwrite=t
java -Djava.library.path=/home/calculon/PROGRAMS/bbmap/jni/ -ea -Xmx22931m -Xms22931m -cp /home/calculon/PROGRAMS/bbmap/current/ jgi.Dedupe in=1pW_R1_100000.fastq ac=f interleaved=f overwrite=t
Executing jgi.Dedupe [in=1pW_R1_100000.fastq, ac=f, interleaved=f, overwrite=t]

Set INTERLEAVED to false
Initial:
Memory: free=22803m, used=240m

Found 32 duplicates.
Finished exact matches.    Time: 0.251 seconds.
Memory: free=21360m, used=1683m

Input:                  	100000 reads 		24971771 bases.
Duplicates:             	32 reads (0.03%) 	8918 bases (0.04%)     	0 collisions.
Result:                 	99968 reads (99.97%) 	24962853 bases (99.96%)

Time:   			0.262 seconds.
Reads Processed:        100k 	381.80k reads/sec
Bases Processed:      24971k 	95.34m bases/sec


or when I omit setting output file:



Code:
dedupe.sh in=1pW_interleaved.fastq outd=1pW_dupes.fastq ac=f interleaved=t overwrite=t
java -Djava.library.path=/home/calculon/PROGRAMS/bbmap/jni/ -ea -Xmx22936m -Xms22936m -cp /home/calculon/PROGRAMS/bbmap/current/ jgi.Dedupe in=1pW_interleaved.fastq outd=1pW_dupes.fastq ac=f interleaved=t overwrite=t
Executing jgi.Dedupe [in=1pW_interleaved.fastq, outd=1pW_dupes.fastq, ac=f, interleaved=t, overwrite=t]

Set INTERLEAVED to true
Initial:
Memory: free=22687m, used=361m

Found 4 duplicates.
Finished exact matches.    Time: 0.367 seconds.
Memory: free=21124m, used=1924m

Input:                  	194776 reads 		48427883 bases.
Duplicates:             	4 reads (0.00%) 	836 bases (0.00%)     	39916 collisions.
Result:                 	194772 reads (100.00%) 	48427047 bases (100.00%)

Time:   			0.376 seconds.
Reads Processed:        194k 	517.84k reads/sec
Bases Processed:      48427k 	128.75m bases/sec

Last edited by TauOvermind; 10-02-2014 at 07:58 AM. Reason: Problem disappeared after updating to BBtools v33.54
TauOvermind is offline   Reply With Quote
Old 10-02-2014, 07:57 AM   #7
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Sorry about that; there was an incorrect assertion checking for the size of an array that assumed unpaired reads. It's fixed in the latest version (33.54). Alternately, you can add the "-da" flag to disable assertions, but I don't recommend that.
Brian Bushnell is offline   Reply With Quote
Old 10-02-2014, 08:01 AM   #8
TauOvermind
Member
 
Location: UK

Join Date: Jul 2012
Posts: 14
Default

Thank you, Brian, I updated to the latest version and everything works perfectly now.
TauOvermind is offline   Reply With Quote
Reply

Tags
adapter, bbduk, duplicate, fastqc, illumina

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:00 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO