View Single Post
Old 04-24-2014, 04:01 PM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Now for some comparative performance tests. I generated some synthetic adapter-contaminated data, using this methodology:

First, I grabbed the first million reads from some bacterial project, 150bp HiSeq. This is to get real quality distributions.

Then I made a synthetic adapter file, called "gruseq.fa", by taking the truseq adapters (in /bbmap/resources/truseq.fa) and rotating the letters: A->T, C->A, G->C, T->G. Thus they should totally unlike biological sequences or real adapter sequences, yet computationally equivalent.

Then, I added adapters to the real reads using a special program I made (also in BBTools) called AddAdapters:

addadapters.sh in=reads.fq out=dirty.fq qout=33 ref=gruseq.fa right int=f

"int=f" makes the reads treated as single-ended, to simplify things. "right" means the adapters will be 3' type. So, they will be added at a random location from 0 to 149, and possibly run off the 3' end of the read, but the read length stays at 150. If the adapter ends before the end of the read, random bases are used to fill the rest. Approximately 50% of the reads get adapters, and 50% don't. After the adapter is added, each of the adapter bases is possibly changed to a new base, with a probability from the read's quality score for that base, to simulate sequencing error.

Next, I ran 3 different programs - Trimmomatic, Cutadapt, and BBDuk, and measured their performance. This was on a 16-core, dual-socket Sandy Bridge E node with 128GB RAM, reserved for exclusive use, and only interacting with local disk, so the rest of the cluster had no impact on the tests. Hyperthreading was enabled.

Quote:
time cutadapt -m 10 -b CTGACCTTCTCATATACGAGCTTAGAATCGATATGATACTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATAACTGCGTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATAGGTCCATGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATAGCTAATTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATATATCGCTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATACAATTGTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATAATCTGATGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATATAGGCTTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATACTGATCTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATAGTCAGGTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATACCAGTATGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATAAGGCGTTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATATCGATTATTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATATCGGAACGTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATATGCGATCTTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATAAACGAAACTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATACGAACATATGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATACGCTTTACTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATACGCCAAGGTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATACGGGACCTTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATAACGTACGTTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATACTCGCCTGTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATATAGCTGTGTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATATGGAAGGGTGAGACGTGCAACGAGGAGCAGGC dirty.fq > cutadapt.fq

real 8m43.129s
user 8m39.720s
sys 0m2.090s
(523.129 seconds)

Quote:
time java -Xmx8g -jar trimmomatic-0.32.jar SE -phred33 dirty.fq trimmomatic.fq ILLUMINACLIP:gruseq.fa:2:28:10 MINLEN:10

real 0m7.418s
user 1m50.996s
sys 0m3.061s
(7.418 seconds, 170.996 cpu-seconds)


Quote:
time bbduk.sh in=dirty.fq ref=gruseq.fa ktrim=r mink=12 hdist=1 out=bbduk.fq minlen=10

real 0m1.683s
user 0m9.926s
sys 0m0.813s
(1.683 seconds, 9.926 cpu-seconds)

So, Cutadapt is extremely slow, and BBDuk takes both the speed and efficiency wins by a large margin. Of course, accuracy is more important than speed, so I graded the results. addadapters.sh already replaced each read's original name with a synthetic name indicating its original length and the length that the read should be after trimming. For example, "@0_150_11" means that the read was originally 150bp, but should be 11bp after trimming, because an adapter was added at position 11 (0-based). This allows quantification of both the number of bases correctly and incorrectly removed. Ideally, "Adapters Remaining" and "Non-Adapter Removed" should both be zero after trimming. For reference, this is what the untrimmed file looks like:

Quote:
addadapters.sh in=dirty.fq grade

Total output: 1000000 reads 150000000 bases
Perfectly Correct (% of output): 499861 reads (49.986%) 74979150 bases (49.986%)
Incorrect (% of output): 500139 reads (50.014%) 75020850 bases (50.014%)

Adapters Remaining (% of adapters): 500139 reads (100.000%) 37776278 bases (25.184%)
Non-Adapter Removed (% of valid): 0 reads (0.000%) 0 bases (0.000%)
Roughly 50% of the reads have adapters and 25% of the bases are sequence that should be removed (either adapter or random bases after the adapter).

Quote:
addadapters.sh in=cutadapt.fq grade

Total output: 984925 reads 131869936 bases
Perfectly Correct (% of output): 690175 reads (70.074%) 87863452 bases (66.629%)
Incorrect (% of output): 294750 reads (29.926%) 44006484 bases (33.371%)

Adapters Remaining (% of adapters): 275167 reads (56.728%) 19789625 bases (15.007%)
Non-Adapter Removed (% of valid): 19583 reads (1.988%) 67473 bases (0.060%)

Quote:
addadapters.sh in=trimmomatic.fq grade

Total output: 981620 reads 131483263 bases
Perfectly Correct (% of output): 630177 reads (64.198%) 85281026 bases (64.861%)
Incorrect (% of output): 351443 reads (35.802%) 46202237 bases (35.139%)

Adapters Remaining (% of adapters): 285886 reads (59.342%) 19483613 bases (14.818%)
Non-Adapter Removed (% of valid): 65557 reads (6.678%) 131182 bases (0.117%)

Quote:
addadapters.sh in=bbduk.fq grade

Total output: 966786 reads 113303242 bases
Perfectly Correct (% of output): 901541 reads (93.251%) 103689866 bases (91.515%)
Incorrect (% of output): 65245 reads (6.749%) 9613376 bases (8.485%)

Adapters Remaining (% of adapters): 65243 reads (13.973%) 1229480 bases (1.085%)
Non-Adapter Removed (% of valid): 2 reads (0.000%) 27 bases (0.000%)
BBDuk performs quite well, vastly outperforming Cutadapt and Trimommatic on every metric. Trimmomatic and Cutadapt both do quite poorly, though of the two, Cutadapt has both a higher true positive rate and a much lower false-positive rate than Trimmomatic, so takes second place in accuracy. If anyone has any other adapter-trimming tools they commonly use, please reply and I'll be happy to test them with the same methodology. Also, if anyone has suggestions for better parameters, please reply; I have no experience with either tool so I'm basically using the defaults.

All of this is testable and repeatable - you can use your own data and your own adapters, or the attached "gruseq.fa.gz" file and this qual file to replicate my results. The exact numbers will depend somewhat on the quality values of the real data, and very slightly on the organism.

P.S. You can get slightly better accuracy with two passes using different values of k and hdist, like this:

Quote:

bbduk.sh -Xmx1g in=dirty.fq out=stdout.fq ref=gruseq.fa k=27 ktrim=r hdist=2 mink=16 | bbduk.sh -Xmx1g in=stdin.fq out=clean.fq ref=gruseq.fa k=23 ktrim=r hdist=0 ow mink=10

addadapters.sh in=bbduk.fq grade


Total output: 966944 reads 113088318 bases
Perfectly Correct (% of output): 911549 reads (94.271%) 104827222 bases (92.695%)
Incorrect (% of output): 55395 reads (5.729%) 8261096 bases (7.305%)

Adapters Remaining (% of adapters): 55393 reads (11.870%) 1006866 bases (0.890%)
Non-Adapter Removed (% of valid): 2 reads (0.000%) 20 bases (0.000%)
Attached Images
File Type: png AdapterTrimSpeed.png (32.8 KB, 67 views)
Attached Files
File Type: gz gruseq.fa.gz (269 Bytes, 31 views)

Last edited by Brian Bushnell; 11-19-2015 at 09:59 AM.
Brian Bushnell is offline   Reply With Quote