Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extract k-mers from a FASTA file

    Hi everyone,

    I'm looking for a library in C which would allow me to :

    - Extract the first x k-mers of a FASTA file (put them in a buffer)
    - Make some treatments on these k-mers
    - Extract the next x k-mers of a FASTA file (put them in a buffer)
    - Make some treatments on these k-mers
    - And so on, until the end of the FASTA file.

    Basically, the goal is to read a FASTA file by chunks of x k-mers, I don't want to load a complete sequence in memory (to extract the k-mers after) especially if this one is very long.
    I could do this code by myself but I'm pretty sure this was already implemented somewhere so if you have a clue where I can find this, it would be very nice to tell me.

    Thank you a lot for your help.

    Best regards.

  • #2
    Or you can use wordcount in Emboss.

    Comment


    • #3
      First of all, thank you for your answer.

      The thing with wordcount in EMBOSS, like Jellyfish or DSK, is that they extract all unique words of size k with their counts.
      In my case, I don't need the counts and I don't want to read only unique k-mers. If the k-mer x appears y times in my FASTA file, I want to read it y times where it appears (the order is important).

      Furthermore, I also want to use a buffer because I never want to use more memory for extracting the k-mers that the size of this buffer. In these tools for counting k-mers, I can parameterize the size of the memory (RAM or disk) I want to use (most of the time) but I cannot make some treatments on the k-mers extracted when the memory I had allocated is full.

      Comment


      • #4
        I have created an R package which can do a variety of k-mer counts on fasta and fastq. The C library is open, so you can insert specialized actions at specific points. The only drawback is that the library starts to get slow when k>10.

        You can download it from R-forge:
        install.packages("seqTools", repos="http://R-Forge.R-project.org")

        Additionally, I have a manuscript in preparation because I found batch effects in fastq files by clustering k-mer counts:
        Batch effects, artificial sources of variation due to experimental design, are a widespread phenomenon in high throughput data. Therefore, mechanisms for detection of batch effects are needed requiring comparison of multiple samples. We apply hierarchical clustering (HC) on DNA k-mer counts of multiple RNA-seq derived Fastq files. Ideally, HC generated trees reflect experimental treatment groups and thus may indicate experimental effects, but clustering of preparation groups indicates the presence of batch effects. In order to provide a simple applicable tool we implemented sequential analysis of Fastq reads with low memory usage in an R package (seqTools) available on Bioconductor. DNA k-mer counts were analysed on 61 Fastq files containing RNA-seq data from two cell types (dermal fibroblasts and Jurkat cells) sequenced on 8 different Illumina Flowcells. Results: Pairwise comparison of all Flowcells with hierarchical clustering revealed strong Flowcell based tree separation in 6 (21 %) and detectable Flowcell based clustering in 17 (60.7 %) of 28 Flowcell comparisons. In our samples, batch effects were also present in reads mapped to the human genome. Filtering reads for high quality (Phred >30) did not remove the batch effects. Conclusions: Hierarchical clustering of DNA k-mer counts provides a quality criterion and an unspecific diagnostic tool for RNA-seq experiments.


        The C code may be a bit difficult to understand because I keep two sequence arrays (due to work on compressed files and for skipping of newlines). It works sequentially, so memory consuption mainly depends on k. Just contact me if you have questions. Any feedback would be great.

        Wolfgang

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Essential Discoveries and Tools in Epitranscriptomics
          by seqadmin




          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
          04-22-2024, 07:01 AM
        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Yesterday, 11:49 AM
        0 responses
        15 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-24-2024, 08:47 AM
        0 responses
        16 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        61 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        60 views
        0 likes
        Last Post seqadmin  
        Working...
        X