Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • kmer size and coverage cutoff for digital normalization using the khmer suite

    Hi,

    I want to use digital normalization on a set of single cell sequencing data as well as metagenomic date from low complexity communities. I'm probably missing some really obvious point, but I just really not sure how to apply the recommended diginorm cutoffs to my relatively long Miseq-reads.

    Both, our single cell sequencing and our low-complex metagenomic sequencing data, were produced on a Miseq, yielding several million paired-end reads of ~250-300 bp length each.

    The general recommendations in the khmer documentation state that you should normalize to a coverage of 1x to 5x using three-pass normalization and a kmer size of 20.

    My question is: are those recommendations really suited for modern "long read" illumina data? If i reduce the kmer coverage for all kmers of length 20 to 5x or less, won't that reduce the coverage for larger kmers far too extremely?

    Without diginorm, the optimal kmer-size using e.g. metavelvet is mostly around k81-101 for my datasets. How can there be enough kmer-coverage left for kmers at that size for deBruiJn-graph based assemblies if already the kmers of length 20 are reduced to less than 5x coverage?

    My version of khmer doesn't seem to support using kmers larger than 31 so apparently larger kmer-sizes are simply not needed for diginorm. I just do not understand why...

  • #2
    diginorm k-mer size/coverage doesn't directly correlate with assembly parameters

    Hi jov14,

    the short answer is that because khmer/diginorm retains or rejects entire reads, the k-mer size and coverage of that process are only weakly connected with what the assembler sees and does. That having been said, we are working on increasing k size and doing things like memory efficient error correction instead, which would give you more choices.

    A slightly longer answer: what diginorm is actually doing is aligning the reads to the De Bruijn graph, and while the alignment process depends on k, the alignment itself is not so sensitive to k. Then, diginorm looks at the coverage of the alignment in the graph and decides whether to accept or reject the read. This changes the coverage from random/whole genome shotgun to systematic/smooth, which has many (often good) effects on the resulting assembly. But it also tweaks the coverage distribution - while a coverage of 5 would be disastrous for whole genome shotgun (because you'd miss ~5% of bases!) the variance on the diginormed data is much lower, so you get a reduced set of reads that still contain all the information of the original set of reads.

    I hope that helps!

    Comment


    • #3
      Oh, sorry, to answer your original question:

      I would suggest running a single pass C=20/k=20, and only doing further error trimming etc if you are running into out-of-memory problems. We've found C=20/k=20 works pretty well for most sequence.

      Comment


      • #4
        Thanks for your answer and suggestion!
        After Iposted this "problem" and had some more time to think again it came back to me:
        Since, as you say, Diginorm only starts to exclude reads if ALL kmers in a read already have counts higher than the cutoff and reads are always kept if even only one new kmer is present in the read, of course the final kmer coverage for each individual kmer will be much higher than the cutoff. I simply forgot that and my problem is really nonexistant.

        Acutally I already used three pass normalization procedures on previous data (where I had read lengths of 100 bp) using C=20 in the first pass and C=5 in the third (must have picked that up in one of your tutorials somewhere).
        I usually then do two assemblies, one with first-pass-normalized data and one with third-pass-normalized data and then just pick the assembly that looks best (At least for single cell data both are usually way better than with non-normalized data).

        However, would you say that for higher read lengths higher kmer values would bring some advantages (I would expect at least the identification of unique kmers for the kmer-trimming/error-correction-step would be perhaps more specific), or would you say the values should better just be left as they are?

        Comment


        • #5
          You can probably get slightly better performance on nasty large repetitive genomes with larger k-mers, for sure! I balance that in my lab against the point that we feel very comfortable with k=20/C=20 for transcriptomes and metagenomes based on our personal experience.

          Report back if you play around - I'd love to hear more!

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM
          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          22 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          24 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          19 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          50 views
          0 likes
          Last Post seqadmin  
          Working...
          X