View Single Post
Old 09-21-2015, 03:23 PM   #10
Brian Bushnell
Super Moderator
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707

Originally Posted by mcarson View Post
Brian, a couple final questions. I know the "normal" JGI pipeline uses PANDAseq to merge reads so I assume BBMap is used to adapter trim and quality filter. However it seems like your bbmerge is equivilant or better especially using the qtrim2 option. Any benefits to doing one over the other (you're not biased right...)?
JGI's Itagger pipeline currently uses Flash by default for 16S, and PANDAseq for ITS. I'm not sure why it is set up that way. But, these were determined when it was written, which was before BBMerge existed. There is a plan to switch to BBMerge when it is updated, but I don't know when that might happen.

BBMerge exceeds the accuracy of all other mergers that I've tested, which includes all that I am aware of with the exception of PANDAseq. Unfortunately, PANDAseq has very specific requirements for read header names and it will refuse to process data with headers that do not look like they are from Illumina. So, I was unable to benchmark it on synthetic data with known answers stored in the read headers, which allows the results to be verified. I contacted the author, but he does not intend to change that requirement.

My test results from a few months ago:

The different points for each program indicate the true and false positive merge rates at different confidence thresholds; upper-left corner of the graph is optimal. The dotted line indicates the highest possible correct merge rate due to overlap information alone, as only 84% of the reads actually overlapped. The points with black centers indicate default settings for the program.

As for whether I'm biased, well... it's hard to judge in yourself But I don't recommend my tools in cases where I know of a better alternative.

HESmith also mentioned that the phiX should be gone since these were already demultiplexed. I'm assuming that's true or should I also do a contaminant filter?
That's correct; all of the phiX should be gone. In practice, though, sometimes the reads break and chimerically rejoin; sometimes adapters ligate to the wrong thing; and sometimes clusters are too close together, causing barcode misassignment. These can yield phiX in the demultiplexed output (as well as other cross-contamination and assorted junk). The level should be very low, but I always find some, if there enough reads.

Finally we normally get paired end reads without linkers and typically without primers. I'm assuming I should trim off the linkers as well as primers prior to bbmerge since it doesn't appear that there is an option for that (or does this not really affect alignment at all?).
Trimming adapters prior to merging will very slightly improve accuracy, as there are slightly fewer incorrect alternative answers from which to select the optimal overlap.

I think there is a sequence trim option in bbduk that I should be able to get that done in with, I'd assume it takes the standard nucleotide lettering system?
Yes, you can specify your own adapter sequence as a fasta file, or on the command line, e.g. "literal=ACGTTGCA...".

Pairs: 157324
Joined: 136389 86.693%
Ambiguous: 11862 7.540%
No Solution: 9073 5.767%
Too Short: 0 0.000%

Avg Insert: 299.1
Standard Deviation: 6.9
Mode: 299

Insert range: 51 - 478
90th percentile: 299
75th percentile: 299
50th percentile: 299
25th percentile: 299
10th percentile: 299
Looks like your insert sizes are generally 299bp With that knowledge, you could increase the merge rate by, for example, adding the flags "mininsert=50 minoverlap=50" to reduce the search space, and thus reduce the number of ambiguous pairs. Or even "mininsert=100 minoverlap=100".

Thanks again for all the help on this, I really appreciate it.

You're welcome!
Attached Images
File Type: png BBMerge_May_2015.png (47.0 KB, 29 views)
Brian Bushnell is offline   Reply With Quote