View Single Post
Old 04-30-2014, 04:08 PM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Quote:
Originally Posted by roliwilhelm View Post
Hello,

My question has two parts. Starting with the most important question:

1) I've read that short-read assemblers designed for metagenomic data make use of read abundance in the assembly process. However, I like the idea of digital normalization (using khmer) as a tool for bringing out reads from genomes which are less abundant in the mix. Is it a wise idea to perform digital normalization and then use assemblers geared for metagenomic data, like MetaVelvet, IDBA_UD or RAY-meta?
I think normalization would be more useful when assembling metagenomic data with a normal assembler. I've found it to improve metagenome assemblies with Soap and Velvet, for example, but have not tried it with metagenome assemblers.

Quote:
2) I have already attempted to partition my metagenomic data using khmer to improve assembly, however found it is very very computationally intensive and slow. The authors of khmer seem to suggest that one viable method would be to perform digital normalization, then partitioning and then "re-inflated" your reads to pre-normalized abundances (see khmer documentation (HERE). I am keen to try that, but the script ("sweep-reads3.py") is no longer comes prepackaged with the new khmer release. I did find it on their git-hub account, HERE, but wonder why it is no longer packaged with the release. Before investing time and energy, I was wondering if anyone has thoughts on this?

Thanks in advance
BBNorm is much faster than khmer, has less bias toward error reads, and it also supports partitioning - rather than normalizing, you can split data into a low coverage bin, medium coverage bin, and high-coverage bin, with custom cutoffs. That would make a lot more sense, computationally.

For example:
bbnorm.sh passes=1 in=reads.fq outlow=low.fq outmid=mid.fq outhigh=high.fq lowbindepth=50 highbindepth=200
That will divide the data into coverage 1-50, 51-199, and 200+.

To normalize to depth 50, the command would be:
bbnorm.sh in=reads.fq out=normalized.fq target=50

Also, there is an alternative to normalization, that works like this:

Downsample to 1% depth.
Assemble.
Map to assembly and keep reads that don't map.
Downsample unmapped reads to 10% depth.
Assemble.
Combine assemblies (with a tool like Dedupe to prevent redundant contigs).
Map to combined assembly.
...
etc.

It's not clear that either is universally better; both approaches have advantages and disadvantages. You can downsample with reformat, also in the BBTools package, like this:
reformat.sh in=reads.fq out=sampled.fq samplerate=0.01

Last edited by Brian Bushnell; 04-30-2014 at 04:28 PM.
Brian Bushnell is offline   Reply With Quote