Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • roliwilhelm
    Member
    • Jun 2012
    • 38

    "Re-inflating Partioned" Metagenomic Data - KHMER

    Hello,

    My question has two parts. Starting with the most important question:

    1) I've read that short-read assemblers designed for metagenomic data make use of read abundance in the assembly process. However, I like the idea of digital normalization (using khmer) as a tool for bringing out reads from genomes which are less abundant in the mix. Is it a wise idea to perform digital normalization and then use assemblers geared for metagenomic data, like MetaVelvet, IDBA_UD or RAY-meta?

    2) I have already attempted to partition my metagenomic data using khmer to improve assembly, however found it is very very computationally intensive and slow. The authors of khmer seem to suggest that one viable method would be to perform digital normalization, then partitioning and then "re-inflated" your reads to pre-normalized abundances (see khmer documentation (HERE). I am keen to try that, but the script ("sweep-reads3.py") is no longer comes prepackaged with the new khmer release. I did find it on their git-hub account, HERE, but wonder why it is no longer packaged with the release. Before investing time and energy, I was wondering if anyone has thoughts on this?

    Thanks in advance
  • Brian Bushnell
    Super Moderator
    • Jan 2014
    • 2709

    #2
    Originally posted by roliwilhelm View Post
    Hello,

    My question has two parts. Starting with the most important question:

    1) I've read that short-read assemblers designed for metagenomic data make use of read abundance in the assembly process. However, I like the idea of digital normalization (using khmer) as a tool for bringing out reads from genomes which are less abundant in the mix. Is it a wise idea to perform digital normalization and then use assemblers geared for metagenomic data, like MetaVelvet, IDBA_UD or RAY-meta?
    I think normalization would be more useful when assembling metagenomic data with a normal assembler. I've found it to improve metagenome assemblies with Soap and Velvet, for example, but have not tried it with metagenome assemblers.

    2) I have already attempted to partition my metagenomic data using khmer to improve assembly, however found it is very very computationally intensive and slow. The authors of khmer seem to suggest that one viable method would be to perform digital normalization, then partitioning and then "re-inflated" your reads to pre-normalized abundances (see khmer documentation (HERE). I am keen to try that, but the script ("sweep-reads3.py") is no longer comes prepackaged with the new khmer release. I did find it on their git-hub account, HERE, but wonder why it is no longer packaged with the release. Before investing time and energy, I was wondering if anyone has thoughts on this?

    Thanks in advance
    BBNorm is much faster than khmer, has less bias toward error reads, and it also supports partitioning - rather than normalizing, you can split data into a low coverage bin, medium coverage bin, and high-coverage bin, with custom cutoffs. That would make a lot more sense, computationally.

    For example:
    bbnorm.sh passes=1 in=reads.fq outlow=low.fq outmid=mid.fq outhigh=high.fq lowbindepth=50 highbindepth=200
    That will divide the data into coverage 1-50, 51-199, and 200+.

    To normalize to depth 50, the command would be:
    bbnorm.sh in=reads.fq out=normalized.fq target=50

    Also, there is an alternative to normalization, that works like this:

    Downsample to 1% depth.
    Assemble.
    Map to assembly and keep reads that don't map.
    Downsample unmapped reads to 10% depth.
    Assemble.
    Combine assemblies (with a tool like Dedupe to prevent redundant contigs).
    Map to combined assembly.
    ...
    etc.

    It's not clear that either is universally better; both approaches have advantages and disadvantages. You can downsample with reformat, also in the BBTools package, like this:
    reformat.sh in=reads.fq out=sampled.fq samplerate=0.01
    Last edited by Brian Bushnell; 04-30-2014, 03:28 PM.

    Comment

    • cjfields
      Junior Member
      • Sep 2009
      • 6

      #3
      Worth mentioning that the partitioning that BBNorm appears to use is coverage-based. khmer uses a (simplified) de Bruijn graph-based partitioning (separating into disconnected partitions of the graph). That's a very important distinction between the two.
      Last edited by cjfields; 05-05-2014, 12:02 PM.

      Comment

      • Brian Bushnell
        Super Moderator
        • Jan 2014
        • 2709

        #4
        Originally posted by cjfields View Post
        Worth mentioning that the partitioning that BBNorm appears to use is coverage-based. khmer uses a (simplified) de Bruijn graph-based partitioning (separating into disconnected partitions of the graph). That's a very important distinction between the two.
        Thanks for pointing that out; I was unaware that khmer's partitioning was NOT coverage-based. Yes, BBNorm's partitioning is purely coverage-based and will not be useful except in situations where you have multiple organisms (or organelles, plasmids, etc) with highly different coverage, though that's typically the case in metagenomes.

        That said - I've found partitioning by connectivity (overlap, debruijn, etc) in metagenomes to be problematic with short (~100bp) reads; the situation can easily devolve into a single cluster because everything will be connected by a single highly-conserved element, like a 16s subsequence. With longer (~250bp) reads it seems to work better.

        Comment

        • crusoe
          Programmer & Bioinformatician
          • Oct 2012
          • 10

          #5
          The current and up-to-date method for metagenomic assembly using khmer is kept at http://khmer-protocols.readthedocs.org

          Comment

          • titusbrown
            Junior Member
            • Aug 2013
            • 8

            #6
            Reinflation isn't necessary.

            Hi all, we now have direct evidence that digital normalization works fine with both SPAdes and IDBA on at least one mock community data set -- neither assembler seems to do poorly on it. I haven't written it up for a blog post yet but I'm happy to send the numbers your way if you're interested.

            --titus

            Comment

            • cjfields
              Junior Member
              • Sep 2009
              • 6

              #7
              Originally posted by titusbrown View Post
              Hi all, we now have direct evidence that digital normalization works fine with both SPAdes and IDBA on at least one mock community data set -- neither assembler seems to do poorly on it. I haven't written it up for a blog post yet but I'm happy to send the numbers your way if you're interested.

              --titus
              That's awesome, thanks for checking this out Titus!

              -chris

              Comment

              Latest Articles

              Collapse

              • GATTACAT
                Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                by GATTACAT
                Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
                07-01-2026, 11:43 AM
              • SEQadmin2
                Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                by SEQadmin2


                I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                Here are nine questions we think about, in roughly the order they matter, before...
                06-18-2026, 07:11 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by SEQadmin2, 07-02-2026, 11:08 AM
              0 responses
              7 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-30-2026, 05:37 AM
              0 responses
              12 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-26-2026, 11:10 AM
              0 responses
              20 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-17-2026, 06:09 AM
              0 responses
              54 views
              0 reactions
              Last Post SEQadmin2  
              Working...