View Single Post
Old 02-01-2017, 02:51 PM   #55
Brian Bushnell
Super Moderator
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,697

So, I would recommend a command would like this:

Code: in1=r1.fq in2=r2.fq out=merged.fq outu=unmerged.fq prefilter=1 extend2=50 k=62 rem adapter=default
This operates in 3 phases.

1) Reads are processed and kmers are counted approximately, for the prefilter flag.
2) Reads are processed again and kmers are counted exactly, ignoring kmers that only occur once (to save memory).
3) Reads are processed again, and merging occurs:
3a) For each pair, merging is attempted.
3b) If a good overlap is not discovered, each read is extended by up to 50bp on the right end only, and merging is attempted again. If they still don't overlap, the extension is undone. Otherwise, they will be merged.

This means that with the flag "extend2=50" you could get up to ~100 bp in the middle that is created from the kmers in other reads. This is not really different from normal assembly; assuming you will assemble this data at some point, this process is going to occur eventually. It's true that this process could result in the formation of chimeric sequence, but even with a complex metagenome, Tadpole has a very low rate of chimeric sequence generation. It will basically only happen if you have two strains of a microbe, one that is over 20x as abundant as the other (you can adjust that '20' with the 'branchmult' flag); in that case, the middle of a pair of reads from the less-abundant strain might get filled with sequence from the more-abundant strain. But in practice this is very rare.

With a median insert of 292, it seems unlikely to me that 25% of the reads would be too far apart to overlap (>~490bp insert). Can you post the insert size distribution (you can get it with the flag ihist=ihist.txt)? It's possible there's another reason for the low merging rate, such as a high error rate, in which case error-correction would be more effective than extension. Of course, error-correction also poses the risk of creation of chimeric sequence, but again, at a very low rate.

Last edited by Brian Bushnell; 02-01-2017 at 02:54 PM.
Brian Bushnell is offline   Reply With Quote