Seqanswers Leaderboard Ad

**TiborNagy** · 06-05-2014, 04:39 AM

Or you can use wordcount in Emboss.

**Wizard50** · 06-05-2014, 05:10 AM

First of all, thank you for your answer.

The thing with wordcount in EMBOSS, like Jellyfish or DSK, is that they extract all unique words of size k with their counts.
In my case, I don't need the counts and I don't want to read only unique k-mers. If the k-mer x appears y times in my FASTA file, I want to read it y times where it appears (the order is important).

Furthermore, I also want to use a buffer because I never want to use more memory for extracting the k-mers that the size of this buffer. In these tools for counting k-mers, I can parameterize the size of the memory (RAM or disk) I want to use (most of the time) but I cannot make some treatments on the k-mers extracted when the memory I had allocated is full.

**wokai001** · 06-06-2014, 04:20 AM

I have created an R package which can do a variety of k-mer counts on fasta and fastq. The C library is open, so you can insert specialized actions at specific points. The only drawback is that the library starts to get slow when k>10.

You can download it from R-forge:
install.packages("seqTools", repos="http://R-Forge.R-project.org")

Additionally, I have a manuscript in preparation because I found batch effects in fastq files by clustering k-mer counts:

Hierarchical clustering of DNA k-mer counts in RNA-seq fastq files reveals batch effects

http://arxiv.org/abs/1405.0114

Batch effects, artificial sources of variation due to experimental design, are a widespread phenomenon in high throughput data. Therefore, mechanisms for detection of batch effects are needed requiring comparison of multiple samples. We apply hierarchical clustering (HC) on DNA k-mer counts of multiple RNA-seq derived Fastq files. Ideally, HC generated trees reflect experimental treatment groups and thus may indicate experimental effects, but clustering of preparation groups indicates the presence of batch effects. In order to provide a simple applicable tool we implemented sequential analysis of Fastq reads with low memory usage in an R package (seqTools) available on Bioconductor. DNA k-mer counts were analysed on 61 Fastq files containing RNA-seq data from two cell types (dermal fibroblasts and Jurkat cells) sequenced on 8 different Illumina Flowcells. Results: Pairwise comparison of all Flowcells with hierarchical clustering revealed strong Flowcell based tree separation in 6 (21 %) and detectable Flowcell based clustering in 17 (60.7 %) of 28 Flowcell comparisons. In our samples, batch effects were also present in reads mapped to the human genome. Filtering reads for high quality (Phred >30) did not remove the batch effects. Conclusions: Hierarchical clustering of DNA k-mer counts provides a quality criterion and an unspecific diagnostic tool for RNA-seq experiments.

The C code may be a bit difficult to understand because I keep two sequence arrays (due to work on compressed files and for skipping of newlines). It works sequentially, so memory consuption mainly depends on k. Just contact me if you have questions. Any feedback would be great.

Wolfgang

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Extract k-mers from a FASTA file

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News