Seqanswers Leaderboard Ad

**GenoMax** · 09-04-2014, 03:45 AM

Do you know the pattern you want to look for (e.g. SPAM in the example above) then fuzznuc from EMBOSS may be an option.

If you are looking for patterns de novo then a k-mer search (of the length you want) may work.

Having to allow for mismatches would make this job lot more difficult.

Then there may be the possibility of using "grep" in some creative ways (if you know the pattern you want to search for).

**Brian Bushnell** · 09-04-2014, 08:34 AM

BBDuk can find all instances of a string (up to 31 bp) allowing a set number of mismatches. It will not return a matrix of positions, but it can replace all instances with some symbol that is then easy to find with a different tool that is not capable of handling mismatches. For example:

bbduk.sh -Xmx1g in=reads.fa out=masked.fa literal=ACGT k=4 ktrim=x hdist=1 rcomp=f

For the input file:

Code:

>1
GGGGGACGTGGGGGGGACGT

the output would be

Code:

>1
GGGGGxxxxGGGGGGGxxxx

You can use the "hdist" flag to specify a hamming distance. And you can use the "rcomp" flag to determine whether only forward or both forward and reverse sequences will be replaced. It's fast and multithreaded.

Edit - the functionality I suggested seems to be broken; I will investigate.
Edit2 - Fixed now as of v33.40b.

**maubp** · 09-04-2014, 05:01 PM

Another k-mer based approach would be mirabait from the MIRA v4 assembler. This assumes you pattern SPAM is short enough that you can use it directly as a k-mer to search for.

**Rammaria** · 09-09-2014, 03:24 AM

Many thanks to you all for your great solutions!

I know the pattern exactly (it must be strictly 'SPAM' in my example), so fuzznuc and bbduk seem to be right for me.

**Rammaria** · 09-09-2014, 03:56 AM

Brian, how can I cite bbduk and other bbtools if I use them?

**Brian Bushnell** · 09-09-2014, 07:58 AM

They're not yet published, so you can just cite my name and the Sourceforge website (https://sourceforge.net/projects/bbmap/).

**Rammaria** · 09-09-2014, 08:51 AM

Ok, thank you!

Topics	Statistics	Last Post
Evaluating Genome Sequencing for ECMO Patients in the NICU by seqadmin Started by seqadmin, 12-17-2024, 10:28 AM	0 responses 33 views 0 likes	Last Post by seqadmin 12-17-2024, 10:28 AM
New Genetic Toolkit Refines Studies on Gene Function and Disease by seqadmin Started by seqadmin, 12-13-2024, 08:24 AM	0 responses 49 views 0 likes	Last Post by seqadmin 12-13-2024, 08:24 AM
Study Links Brain Mechanism to Emotional Responses in Animals and Humans by seqadmin Started by seqadmin, 12-12-2024, 07:41 AM	0 responses 34 views 0 likes	Last Post by seqadmin 12-12-2024, 07:41 AM
Study Identifies Ribosomal RNA Fingerprints as Early Cancer Biomarkers by seqadmin Started by seqadmin, 12-11-2024, 07:45 AM	0 responses 46 views 0 likes	Last Post by seqadmin 12-11-2024, 07:45 AM

Seqanswers Leaderboard Ad

Announcement

Fast tool for finding a subsequences?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News