View Single Post
Old 01-17-2017, 12:58 PM   #49
Location: East Coast

Join Date: Jul 2016
Posts: 38

Originally Posted by Brian Bushnell View Post
Do you know what your read length and approximate depth are? Tadpole's default kmer length is 31, but with sufficient depth and read length, you will get a better assembly with a longer kmer.
Hi Brian,

Reads are 150x2, and I generally have >100x coverage. I tried increasing to k=60, but that splits the input into many more contigs.

My goal here is to extract a consensus sequence for the purpose of annotating ORFs. Previous to this, I mapped the reads to a reference using BBMap, and I'm running into problems with multiple contigs being generated by Tadpole. For instance, in BBMap one of my samples has reads mapped across the entirety of the reference. If I extract the reads that mapped to the reference, and use them to de novo assemble in Tadpole, the output is two contigs (the sum of which equal the length of the reference).

Upon inspection of the BBMap file, I find one ambiguity within the reads at the region that Tadpole has split the assembly into two contigs. About 50% of the reads have an "A" in one nucleotide position, while the other half have a "G". My guess is that this 'SNP' was introduced during my PCR amplification (prior to sequencing) or library prep PCR. It doesn't suggest the presence of two viral genomes, because everything else is too homogenous. In my mind, since there is great overlap on both sides of this nucleotide call, I'd rather assemble a single contig, and call an ambiguous base for this position: R.

Any idea on how to accomplish this, and do you agree with my thought? I tried adding "shave=f" as a flag, but still no luck. By the way, what does "f" stand for?

JVGen is offline   Reply With Quote