Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Unexpected kmer occurence distribution across 10 bacterial genomes

    I'm trying to identify unique regions across a large collection of bacterial genomes and my first step is to use tallymer (from the genometools package) to identify unique kmers across the entire db. I started out with a small set of 10 genomes (to familiarize myself with the tools) and after building the suffix index I ran tallymer occratio to get a distribution of unique and non-unique kmers per mer-size, but the distribution is sort of opposite what I expected:

    # distribution of unique mers
    10 520
    11 1490
    12 2403
    13 2887
    14 3077
    15 3151
    16 3178
    17 3191
    18 3197
    19 3200
    20 3203
    21 3205
    22 3207

    # distribution of non unique mers (counting each non unique mer only once)
    10 689042
    11 1306536
    12 1748019
    13 1944353
    14 2012896
    15 2034879
    16 2041859
    17 2044257
    18 2045175
    19 2045601
    20 2045875
    21 2046092
    22 2046265

    Naively I had expected to find more instances of the smaller kmers in my test set than the larger mer-sizes. That assumption was based on my thinking that as my mer-size approaches my genome size the number of possible instances goes down (down to just 1 'mer' whose size is the length of the genome).

    Can anyone comment on what I am seeing based on their own experience? My assumption is based on that one very flimsy thought (a mer size of genome length can only occur once), but I wanted to make sure my results are not unexpected before moving ahead. Note that I have no reason to believe there was any problem with the execution of tallymer (or suffixerator prior to the tallymer occratio command). The jobs finished without warning or error and produced the expected outputs.

  • #2
    I have now looked at the distribution of kmers size 10-30 across the full set of bacterial genomes (~4600), and I'm seeing 6-8 billion kmers each for the size range I was considering for qPCR primer length (18-22bp). I decided to try dumping the actual kmers using tallymer (from genometools) and do some more exploring of my data but I've found that the final step, tallymer search, is going to take an unreasonable amount of time to dump 6-8 billion kmers (after watching it run for a day it would take several years to dump any single kmer size between 18-22).

    Can anyone suggest a strategy I could use to identify primers for qPCR that can uniquely identify bacterial genomes from gut genome samples? The methods I've looked at so far (home brew tallymer approach, and RUCS) seem like they would be good for a smaller collection of genomes but don't scale well up to the # of genomes I want to use as a background

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Strategies for Sequencing Challenging Samples
      by seqadmin


      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
      03-22-2024, 06:39 AM
    • seqadmin
      Techniques and Challenges in Conservation Genomics
      by seqadmin



      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

      Avian Conservation
      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
      03-08-2024, 10:41 AM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, 03-27-2024, 06:37 PM
    0 responses
    12 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 03-27-2024, 06:07 PM
    0 responses
    11 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 03-22-2024, 10:03 AM
    0 responses
    53 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 03-21-2024, 07:32 AM
    0 responses
    69 views
    0 likes
    Last Post seqadmin  
    Working...
    X