Unconfigured Ad

**Brian Bushnell** · 06-25-2014, 08:54 AM

oloeber,

You can either subsample or normalize; sometimes one gives a better assembly than the other. BBNorm will not lose pairing or quality information.

That package also contains Reformat, a tool that can do subsampling:

reformat.sh in=reads.fq.gz out=sampled.fq.gz samplerate=0.04

...will subsample to 4%.

With BBNorm, to normalize to 35x coverage:

bbnorm.sh in=reads.fq.gz out=normalized.fq.gz target=35 -Xmx29g

You may or may not need to set the -Xmx flag, depending on your environment. If you do, then set it to about 85% of the machine's physical memory.

In my experience, normalization normally yields a better assembly than subsampling, particularly if you have variable coverage. The disadvantages of normalization compared to subsampling are that it is slower and uses more memory, and destroys information about repeat content, but I consider these disadvantages to be unimportant if the resultant assembly is better.

Also, I find that Velvet, with default parameters, yields the best assembly at around 35-40x coverage.

**uloeber** · 06-25-2014, 11:39 PM

Thank you for the reply! I try again with BBnorm.

**sdmoore** · 07-05-2014, 10:35 AM

BBnorm target depth setting

Hi Thread/Brian,

I ran BBnorm on a pre-BBduk'd pair of reads. I plan to use the normalized reads for assemblers (Velvet or A5, well, A4 in this case).

./bbnorm.sh in1=R1_bbduk20.fq in2=R2_bbduk20.fq out1=R1_bbduk20_norm35.fq out2=R2_bbduck20_norm35.fq target=35

the opening dialog states that the target depth is 140 on pass 1, then 35 on pass 2.

Why the two passes?

What are the "error reads/pairs/types"?

Do these normalizations typically help a BBmap (or other) run as well?

Thanks.

**Brian Bushnell** · 07-06-2014, 08:20 PM

sdmoore,

There are two passes because normalization can selectively enrich the dataset for reads with errors, if no precautions are taken, since reads with errors appear to have rare kmers. To combat this, I first normalize down to a higher-than-requested level; or specifically, reads that appear error-free are normalized to the high level (140 in this case), and reads that appear to contain errors are normalized to the low level (35 in this case) on the first pass.

So, after the first pass all reads will still have minimum depth 35 (so nothing was lost), but the new dataset will be selectively enriched for error-free reads. The second pass normalizes the remaining reads to the final target regardless of whether errors are suspected.

It's not really possible to do this in a single pass because if (for example) half your reads contain errors, and error-free reads are randomly discarded at the target rate but error-containing reads are discarded at a higher rate, you will ultimately achieve only half of the desired final coverage.

You can set "passes=1" when running normalization, and look at the kmer frequency histogram afterward with "histout=x.txt". The histogram from a run with 2-pass normalization will have far fewer error kmers, which is obvious from a quick visual examination of the graph.

Normalization is not useful for mapping in most circumstances. The only advantage it would convey is a decrease in runtime by reducing the size of the input, which is useful if you are using a huge reference (like nt) or slow algorithm (like BLAST), but I don't use it for that. Error-correction (done by the same program as normalization, with ecc.sh instead of bbnorm.sh) may be useful before mapping, though - it will increase mapping rates, though always with a possibility that the ultimate output may be slightly altered. So, I would not error-correct data before mapping, either, except when using a mapping algorithm that is very intolerant of errors.

I designed BBNorm as a preprocessing step for assembly. Normalizing and/or error-correcting data with highly uneven coverage - single cell amplified data and metagenomes (and possibly transcriptomes, though I have not tried it) - can yield a much better assembly, particularly with assemblers that are designed for isolates with a fairly flat distribution, but even on assemblers designed for single-cell or metagenomic assembly. Also, normalization can allow assembly of datasets that would otherwise run out of memory or take too long. Depending on the circumstance, it often yields better assemblies with isolates too, but not always.

I'm sure there are other places where BBNorm can be useful, but I don't recommend it as a general-purpose preprocessing step prior to any analysis - just for cases where you can achieve better results be reducing data volume, or flattening the coverage, or reducing the error rate.

Oh... and as for the error reads/pairs/types, that shows a summary of the input data quality, and fractions of reads or pairs that appear to contain errors. The "types" indicate which heuristic classified the read as appearing to contain an error; that's really there for testing and I may remove it.

**sdmoore** · 07-07-2014, 05:39 AM

BBduk goose.

Thanks Brian,

That's much clearer and makes sense.

I noticed that fastqc complained more about kmers after the normalization, and I can see why that could happen.

**uloeber** · 07-10-2014, 02:37 AM

Hi Brian,
I used the following command for normalization of my oversequenced genome:

Code:

 ./bbnorm.sh in=species_L004_paired.fastq out=normalized_L004_80x.fastq target=80

Maybe that is the same question as mentioned before, but I expect that errors represented by low frequent kmers have more weight after the normalization. Is it like that?
By the way, even after the first run of you tool my assembly statistics increase. You did a great job! Thank you!
By the way, did I missed to give the pairing argument?

**Brian Bushnell** · 07-10-2014, 08:20 AM

uloeber,

1) BBNorm was designed to prioritize discarding of reads that appear to contain errors, so typically, there are fewer low-frequency kmers after normalization. When you normalize, you can set "hist=khist_input.txt" and "histout=khist_output.txt". The first file will get the frequency histogram of kmers before normalization, and the second one after normalization, so you can see how the process changed the distribution.

2) I'm glad it was helpful!

3) If the input is interleaved, the program will autodetect that, as long as the reads follow the standard Illumina naming patterns, so it should be fine. I will update it to print a message indicating whether it is processing the data as paired or not. You can force a file to be interpreted as interleaved with the "interleaved=t" flag.

**uloeber** · 07-12-2014, 06:55 AM

Hi Brian,
I just do not know if it was successful with respect to the pairing... I got the following exception using DigiNorm:

Code:

** ERROR: Error: Improperly interleaved pairs

Thats why I was wondering if maybe there are some problems in the interleaved mode. Would your software give a hint if it is not possible to run the interleaved mode?
Can you again tell me what is best practice runing two or more iterative steps using BBnorm?

**Brian Bushnell** · 07-12-2014, 03:07 PM

uloeber,

You can run Reformat to verify that interleaving is valid, like this:

reformat.sh in=reads.fq out=null vint

It will print a message indicating whether or not the interleaving is correct, based on the read names. So you can run it on the file from before normalization, and the file after, to check. Another program can fix files with broken interleaving:

bbsplitpairs.sh in=reads.fq out=fixed.fq fint

This is very fast and requires very little memory, but will only work on files that were interleaved, then some of the reads were thrown away. If the reads are arbitrarily disordered, you can run this:

repair.sh in=reads.fq out=fixed.fq repair

This requires a lot of memory, though, as it may potentially store up to half of the reads in memory.

If BBNorm broke pairing, it is much better to redo normalization with the "int=t" flag to force reads to be treated as pairs than to try to fix the corrupted output. If you run it with "int=t" it will completely ignore the read names, and not give any sort of error if they don't match, since the names could have been changed or they could be in some format that is not recognized. If you run with "int=f" then similarly the input will be forced to be used single-ended. And if you run without the flag, it will autodetect but unfortunately it doesn't currently tell you whether the data was processed single-ended or not; you can run this program:

testformat.sh in=reads.fq

...which will print whether or not a file is detected as interleaved (and various other things). If testformat.sh says a file contains single-ended ASCII-33 fastq reads, then all of my programs will treat that file as such (with the exception of repair.sh and bbsplit.sh with the 'fint' flag, because those are designed for corrupted files).

I'm not really sure what you mean by running iterative steps with BBNorm - it runs two passes by default, transparently, as that yields the best assemblies in my testing. You can run multiple iterations yourself if you want (using the output as input), but there's no reason to do that unless the coverage comes out above your target. I have not found that multiple iterations of error-correction are particularly useful, either; one seems to be fine. So if you use this command line:

bbnorm.sh in=reads.fq out=normalized.fq int=t ecc=t hist=khist.txt histout=khist2.txt ow

...then it will do three passes:

1a) Kmer counting of input kmers for frequency histogram
1b) Error correction
1c) Normalization biased against reads with errors

2) Unbiased normalization

3) Kmer counting for output frequency histogram

...which should give you optimal results. The "ow" flag is optional and tells it to overwrite output files if they already exist; the "ecc" flag is optional and usually makes things better, but not always; and of course the "int" flag is optional but I always include it if I know whether or not the reads are paired, since autodetection is reliant on reads having Illumina standard names.

I hope that helps!

Topics	Statistics	Last Post
Engineered Protein Motor Takes Its First Steps Along DNA Track by SEQadmin2 Started by SEQadmin2, Yesterday, 11:05 AM	0 responses 7 views 0 reactions	Last Post by SEQadmin2 Yesterday, 11:05 AM
High-Resolution Sequencing Exposes Hidden Toxoplasma Diversity by SEQadmin2 Started by SEQadmin2, 07-02-2026, 11:08 AM	0 responses 28 views 0 reactions	Last Post by SEQadmin2 07-02-2026, 11:08 AM
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, 06-30-2026, 05:37 AM	0 responses 28 views 0 reactions	Last Post by SEQadmin2 06-30-2026, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 27 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM

Unconfigured Ad

Illumina paired end assembly; in silico normalization

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News