View Single Post
Old 03-10-2015, 12:48 PM   #16
Brian Bushnell
Super Moderator
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707

At default settings, BBNorm will correct using kmers that occur a minimum of 22 times (echighthresh flag). Whether 50x is sufficient to do a good job depends on things like the ploidy of the organism, how heterozygous it is, and the uniformity of your coverage. 50x of fairly uniform coverage is plenty for a haploid organism, or a diploid organism like human with a low het rate, but not for a wild tetraploid plant. You can always reduce the setting, of course, but I would not put it below ~16x typically. You can't get good error-correction with 3x depth nor matter what you do. Bear in mind that the longer the kmer is compared to read length, the lower the kmer depth will be compared to read depth.

To deal with multiple different data sources, you can run BBNorm with the "extra" flag to use additional data to build kmer frequency tables but not as output, like this: in=miseq.fq out=miseq_ecc.fq extra=gaII_1.fq,gaII_2.fq,gaII_3.fq

That would take extra processing time, since all the data would have to be reprocessed for every output file you generate. Alternately, you can do this: miseq.fq addprefix prefix=miseq gaII_1.fq addprefix prefix=gaII_1


Then cat all the files together, and error-correct them: in=combined.fq out=ecc.fq ordered int=f

Then demultiplex: in=ecc.fq out=demuxed_%.fq names=miseq,gaII_1 int=f
(note the % symbol; it will be replaced by a name)

That will keep all the read order the same. So, if all the reads are initially either single-ended or interleaved (i.e. one file per library) pairs will be kept together, and you can de-interleave them afterward if you want. You can convert between 2-file and interleaved with
Brian Bushnell is offline   Reply With Quote