View Single Post
Old 07-12-2014, 03:07 PM   #10
Brian Bushnell
Super Moderator
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707


You can run Reformat to verify that interleaving is valid, like this: in=reads.fq out=null vint

It will print a message indicating whether or not the interleaving is correct, based on the read names. So you can run it on the file from before normalization, and the file after, to check. Another program can fix files with broken interleaving: in=reads.fq out=fixed.fq fint

This is very fast and requires very little memory, but will only work on files that were interleaved, then some of the reads were thrown away. If the reads are arbitrarily disordered, you can run this: in=reads.fq out=fixed.fq repair

This requires a lot of memory, though, as it may potentially store up to half of the reads in memory.

If BBNorm broke pairing, it is much better to redo normalization with the "int=t" flag to force reads to be treated as pairs than to try to fix the corrupted output. If you run it with "int=t" it will completely ignore the read names, and not give any sort of error if they don't match, since the names could have been changed or they could be in some format that is not recognized. If you run with "int=f" then similarly the input will be forced to be used single-ended. And if you run without the flag, it will autodetect but unfortunately it doesn't currently tell you whether the data was processed single-ended or not; you can run this program: in=reads.fq

...which will print whether or not a file is detected as interleaved (and various other things). If says a file contains single-ended ASCII-33 fastq reads, then all of my programs will treat that file as such (with the exception of and with the 'fint' flag, because those are designed for corrupted files).

I'm not really sure what you mean by running iterative steps with BBNorm - it runs two passes by default, transparently, as that yields the best assemblies in my testing. You can run multiple iterations yourself if you want (using the output as input), but there's no reason to do that unless the coverage comes out above your target. I have not found that multiple iterations of error-correction are particularly useful, either; one seems to be fine. So if you use this command line: in=reads.fq out=normalized.fq int=t ecc=t hist=khist.txt histout=khist2.txt ow

...then it will do three passes:

1a) Kmer counting of input kmers for frequency histogram
1b) Error correction
1c) Normalization biased against reads with errors

2) Unbiased normalization

3) Kmer counting for output frequency histogram

...which should give you optimal results. The "ow" flag is optional and tells it to overwrite output files if they already exist; the "ecc" flag is optional and usually makes things better, but not always; and of course the "int" flag is optional but I always include it if I know whether or not the reads are paired, since autodetection is reliant on reads having Illumina standard names.

I hope that helps!
Brian Bushnell is offline   Reply With Quote