Unconfigured Ad

**malachig** · 11-17-2010, 02:19 PM

This should be fun. A real classic bioinformatics task for beginners.

There are some good books out there for learning how to solve these problems.
Beginning Perl for Bioinformatics
Bioinformatics Programming in Python: A Practical Course for Beginners

For working environments you could try:
DNA Linux

This kind of task is also an excellent starting point for learning simple scripting tasks on your own. In other words, you could use this as an excuse to learn some Python, Perl, Regex, Awk, etc.

There are also packages/libraries of code that will have already solved many of these types of basic bioinformatics tasks. To name just a few of these: BioPerl, BioPython, EMBOSS, etc.

**malachig** · 11-17-2010, 03:00 PM

In case you feel that my previous post was dodging your question

... attached is an example Perl script that you could use as a starting point. It uses regex to identify occurrences of one string (an RE sequence) within another string (a chromosome).

In this example if you want to get all the EcoRI sites on chromosome 22 you would do this (from a linux prompt):
./findRestrictionSites.pl --genome_version=hg19 --chromosome=22 --re_site=GAATTC

The output will be one site per line in the format: chr:start-end

There is also a list of online RE analysis tools here.

Attached Files

findRestrictionSites.pl (1.6 KB, 513 views)

**obig** · 11-17-2010, 03:04 PM

If you prefer to use R/Bioconductor, you might investigate the BSgenome and Biostrings packages. Here's a document walking your through the process:

http://www.bioconductor.org/packages/2.3/bioc/vignettes/BSgenome/inst/doc/GenomeSearching.pdf

**lunacab** · 11-18-2010, 05:37 PM

Thanks a lot! Very very useful!

**ParthavJailwala** · 08-19-2011, 11:59 AM

I have used BioStrings and BSgenome to find restriction sites in the mouse genome...it works great. The only caveat is that you have to use 'matchPattern()' on a per chromosome basis, and then append all the output files if a single per genome file is desired.

**Vandelnokk** · 01-22-2016, 04:03 AM

HiCUP

Hi,

check out HiCUP digester in its pipeline:

http://www.bioinformatics.babraham.ac.uk/projects/hicup/scripts_description/#Digester

Best

**craigdj** · 02-23-2016, 07:20 AM

Hi lunacab,

Would you be willing to share your data regarding the restriction site coordinates in the human genome? It would be incredibly helpful!

**dariober** · 02-23-2016, 07:38 AM

Originally posted by malachig View Post

It uses regex to identify occurrences of one string (an RE sequence) within another string (a chromosome).

Just as a comment, if I'm not mistaken your scripts reverse-complements the regular expression, which is something that cannot be done. I'd rather reverse complement the reference sequence even if it is more "expensive".

**blancha** · 02-23-2016, 07:43 AM

EMBOSS is an old program, but it works remarkably well for this type of task.
Don't be fooled by the dated website.
It is a very efficient program.

EMBOSS: restrict

http://emboss.sourceforge.net/apps/cvs/emboss/apps/restrict.html

**malachig** · 07-09-2017, 03:36 PM

Originally posted by dariober View Post

Just as a comment, if I'm not mistaken your scripts reverse-complements the regular expression, which is something that cannot be done. I'd rather reverse complement the reference sequence even if it is more "expensive".

I'm not sure I follow. It doesn't reverse-complement the "regular expression" it reverse complements the restriction enzyme sequence (a string) that is used in the regular expression. We can either (A) search for our string of interest in the reference sequence and its reverse complement, or (B) search for our string of interest and its reverse complement in the reference sequence.

These two approaches should be equivalent. The script uses option B.

Topics	Statistics	Last Post
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, Yesterday, 05:37 AM	0 responses 6 views 0 reactions	Last Post by SEQadmin2 Yesterday, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 17 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 51 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 110 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM

Unconfigured Ad

how to compute all restriction enzyme sites in the human genome?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News