Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • High amount of low frequency / unique k-mers in Illumina reads

    Dear all,

    I would like to hear your suggestion what amount of low frequency k-mers in Illumina reads is normal.

    I am asking this because I am having a hard time to find a good assembly strategy for two 100mb invertebrate genomes I just received. Most of what used to work for my first genome from a similar species does not work now. I get very different results from the different assemblers (Masurca, Dipspades, Platanus) and sometimes they crash.

    The difference in the datasets (all genomes around 100mb):

    - old Illumina dataset: 80x coverage, 150bp PE reads, 450bp insert
    - new Illumina datasets: 160x coverage, 125bp PE reads, 450bp insert

    The main difference seems to be the amount of low frequency k-mers in the reads. To give you an idea: After trimming of one sample with platanus_trim the 32mer histogram from Platanus shows 400 million single occurring k-mers. The hammer correction module of Dipspades also tells me that 80% of k-mers are singletons. A platanus run with the old (trimmed) dataset showed only 400k single occurring 32mers.

    So I am trying back and forth with trimming (trimmomatic, platanus), correction (hammer) and normalization (bbnorm). Masurca, however, has its own built in pipeline for correction and trimming and I just give in the reads as I received them. But while Masurca gave me the best assembly last time, with the new datasets it gives me by far the worst.

    Are there other reasons than sequencing errors or metagenomic contamination for such an amount of low frequency k-mers? At least from my experience, I don't think that contamination of the genomic DNA during isolation is responsible here.

    Any suggestions for a better assembly?

    Thank you!

  • #2
    Start with clean inbread sample and min 2x250 reads...

    Originally posted by balaena View Post
    Dear all,

    The difference in the datasets (all genomes around 100mb):

    - old Illumina dataset: 80x coverage, 150bp PE reads, 450bp insert
    - new Illumina datasets: 160x coverage, 125bp PE reads, 450bp insert

    Any suggestions for a better assembly?
    Thank you!
    0. Low heterozygosity (inbred) specimen (if possible).

    (If using Illumina platform)

    1. PCR-Free library,
    2. a good quality 2x300 or 2x250 Miseq/Hiseq run,
    3. Flash or panda
    4. Subsample and get try getting a repeats library.
    5. Remove the repetitive regions reads, and try assembling unique ones.
    6. You can try CLC-Bio/DNAstars Ngen (if you have access to them, and see what you get).
    7. Do not forget yours repeats sequences to your's final contigs set.

    PS: A nextera matepair library can be quite a heplfull adition if you need longer scaffolds.

    PPS: The above dataset can be used for pacbio/nanopore reads correction sometime in a future.

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Essential Discoveries and Tools in Epitranscriptomics
      by seqadmin


      The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
      Yesterday, 07:01 AM
    • seqadmin
      Current Approaches to Protein Sequencing
      by seqadmin


      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
      04-04-2024, 04:25 PM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, 04-11-2024, 12:08 PM
    0 responses
    55 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-10-2024, 10:19 PM
    0 responses
    52 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-10-2024, 09:21 AM
    0 responses
    45 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-04-2024, 09:00 AM
    0 responses
    55 views
    0 likes
    Last Post seqadmin  
    Working...
    X