Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extract k-mers from a FASTA file

    Hi everyone,

    I'm looking for a library in C which would allow me to :

    - Extract the first x k-mers of a FASTA file (put them in a buffer)
    - Make some treatments on these k-mers
    - Extract the next x k-mers of a FASTA file (put them in a buffer)
    - Make some treatments on these k-mers
    - And so on, until the end of the FASTA file.

    Basically, the goal is to read a FASTA file by chunks of x k-mers, I don't want to load a complete sequence in memory (to extract the k-mers after) especially if this one is very long.
    I could do this code by myself but I'm pretty sure this was already implemented somewhere so if you have a clue where I can find this, it would be very nice to tell me.

    Thank you a lot for your help.

    Best regards.

  • #2
    Or you can use wordcount in Emboss.

    Comment


    • #3
      First of all, thank you for your answer.

      The thing with wordcount in EMBOSS, like Jellyfish or DSK, is that they extract all unique words of size k with their counts.
      In my case, I don't need the counts and I don't want to read only unique k-mers. If the k-mer x appears y times in my FASTA file, I want to read it y times where it appears (the order is important).

      Furthermore, I also want to use a buffer because I never want to use more memory for extracting the k-mers that the size of this buffer. In these tools for counting k-mers, I can parameterize the size of the memory (RAM or disk) I want to use (most of the time) but I cannot make some treatments on the k-mers extracted when the memory I had allocated is full.

      Comment


      • #4
        I have created an R package which can do a variety of k-mer counts on fasta and fastq. The C library is open, so you can insert specialized actions at specific points. The only drawback is that the library starts to get slow when k>10.

        You can download it from R-forge:
        install.packages("seqTools", repos="http://R-Forge.R-project.org")

        Additionally, I have a manuscript in preparation because I found batch effects in fastq files by clustering k-mer counts:
        Batch effects, artificial sources of variation due to experimental design, are a widespread phenomenon in high throughput data. Therefore, mechanisms for detection of batch effects are needed requiring comparison of multiple samples. We apply hierarchical clustering (HC) on DNA k-mer counts of multiple RNA-seq derived Fastq files. Ideally, HC generated trees reflect experimental treatment groups and thus may indicate experimental effects, but clustering of preparation groups indicates the presence of batch effects. In order to provide a simple applicable tool we implemented sequential analysis of Fastq reads with low memory usage in an R package (seqTools) available on Bioconductor. DNA k-mer counts were analysed on 61 Fastq files containing RNA-seq data from two cell types (dermal fibroblasts and Jurkat cells) sequenced on 8 different Illumina Flowcells. Results: Pairwise comparison of all Flowcells with hierarchical clustering revealed strong Flowcell based tree separation in 6 (21 %) and detectable Flowcell based clustering in 17 (60.7 %) of 28 Flowcell comparisons. In our samples, batch effects were also present in reads mapped to the human genome. Filtering reads for high quality (Phred >30) did not remove the batch effects. Conclusions: Hierarchical clustering of DNA k-mer counts provides a quality criterion and an unspecific diagnostic tool for RNA-seq experiments.


        The C code may be a bit difficult to understand because I keep two sequence arrays (due to work on compressed files and for skipping of newlines). It works sequentially, so memory consuption mainly depends on k. Just contact me if you have questions. Any feedback would be great.

        Wolfgang

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Advancing Precision Medicine for Rare Diseases in Children
          by seqadmin




          Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
          12-16-2024, 07:57 AM
        • seqadmin
          Recent Advances in Sequencing Technologies
          by seqadmin



          Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

          Long-Read Sequencing
          Long-read sequencing has seen remarkable advancements,...
          12-02-2024, 01:49 PM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 12-17-2024, 10:28 AM
        0 responses
        34 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 12-13-2024, 08:24 AM
        0 responses
        50 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 12-12-2024, 07:41 AM
        0 responses
        35 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 12-11-2024, 07:45 AM
        0 responses
        46 views
        0 likes
        Last Post seqadmin  
        Working...
        X