Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • lunacab
    Junior Member
    • Oct 2010
    • 2

    how to compute all restriction enzyme sites in the human genome?

    Dear colleagues,
    I have a very simple question to ask but I am struggling with it...
    I have a restriction enzyme of 6 nucleotides and i want to find ALL sites within the human genome (hg19 for instance) where the restriction enzyme matches the sequence.
    I was trying to use blast but it seems that I am using a too short sequence so it never returns a list.
    Any recommendations on how to compute that?
    thanks a lot in advance
  • malachig
    Senior Member
    • Aug 2010
    • 117

    #2
    This should be fun. A real classic bioinformatics task for beginners.

    There are some good books out there for learning how to solve these problems.
    Beginning Perl for Bioinformatics
    Bioinformatics Programming in Python: A Practical Course for Beginners

    For working environments you could try:
    DNA Linux

    This kind of task is also an excellent starting point for learning simple scripting tasks on your own. In other words, you could use this as an excuse to learn some Python, Perl, Regex, Awk, etc.

    There are also packages/libraries of code that will have already solved many of these types of basic bioinformatics tasks. To name just a few of these: BioPerl, BioPython, EMBOSS, etc.

    Comment

    • malachig
      Senior Member
      • Aug 2010
      • 117

      #3
      In case you feel that my previous post was dodging your question ... attached is an example Perl script that you could use as a starting point. It uses regex to identify occurrences of one string (an RE sequence) within another string (a chromosome).

      In this example if you want to get all the EcoRI sites on chromosome 22 you would do this (from a linux prompt):
      ./findRestrictionSites.pl --genome_version=hg19 --chromosome=22 --re_site=GAATTC

      The output will be one site per line in the format: chr:start-end

      There is also a list of online RE analysis tools here.
      Attached Files

      Comment

      • obig
        Member
        • Nov 2010
        • 12

        #4
        If you prefer to use R/Bioconductor, you might investigate the BSgenome and Biostrings packages. Here's a document walking your through the process:

        Comment

        • lunacab
          Junior Member
          • Oct 2010
          • 2

          #5
          Thanks a lot! Very very useful!

          Comment

          • ParthavJailwala
            Member
            • Oct 2009
            • 27

            #6
            I have used BioStrings and BSgenome to find restriction sites in the mouse genome...it works great. The only caveat is that you have to use 'matchPattern()' on a per chromosome basis, and then append all the output files if a single per genome file is desired.

            Comment

            • Vandelnokk
              Junior Member
              • Oct 2012
              • 2

              #7
              HiCUP

              Hi,

              check out HiCUP digester in its pipeline:


              Best

              Comment

              • craigdj
                Junior Member
                • Feb 2016
                • 1

                #8
                Hi lunacab,

                Would you be willing to share your data regarding the restriction site coordinates in the human genome? It would be incredibly helpful!

                Comment

                • dariober
                  Senior Member
                  • May 2010
                  • 311

                  #9
                  Originally posted by malachig View Post
                  It uses regex to identify occurrences of one string (an RE sequence) within another string (a chromosome).
                  Just as a comment, if I'm not mistaken your scripts reverse-complements the regular expression, which is something that cannot be done. I'd rather reverse complement the reference sequence even if it is more "expensive".

                  Comment

                  • blancha
                    Senior Member
                    • May 2013
                    • 367

                    #10
                    EMBOSS is an old program, but it works remarkably well for this type of task.
                    Don't be fooled by the dated website.
                    It is a very efficient program.

                    Comment

                    • malachig
                      Senior Member
                      • Aug 2010
                      • 117

                      #11
                      Originally posted by dariober View Post
                      Just as a comment, if I'm not mistaken your scripts reverse-complements the regular expression, which is something that cannot be done. I'd rather reverse complement the reference sequence even if it is more "expensive".
                      I'm not sure I follow. It doesn't reverse-complement the "regular expression" it reverse complements the restriction enzyme sequence (a string) that is used in the regular expression. We can either (A) search for our string of interest in the reference sequence and its reverse complement, or (B) search for our string of interest and its reverse complement in the reference sequence.

                      These two approaches should be equivalent. The script uses option B.

                      Comment

                      Latest Articles

                      Collapse

                      • SEQadmin2
                        Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                        by SEQadmin2


                        I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                        Here are nine questions we think about, in roughly the order they matter, before...
                        06-18-2026, 07:11 AM
                      • SEQadmin2
                        From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                        by SEQadmin2


                        Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                        The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                        ...
                        06-02-2026, 10:05 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by SEQadmin2, Yesterday, 05:37 AM
                      0 responses
                      6 views
                      0 reactions
                      Last Post SEQadmin2  
                      Started by SEQadmin2, 06-26-2026, 11:10 AM
                      0 responses
                      17 views
                      0 reactions
                      Last Post SEQadmin2  
                      Started by SEQadmin2, 06-17-2026, 06:09 AM
                      0 responses
                      51 views
                      0 reactions
                      Last Post SEQadmin2  
                      Started by SEQadmin2, 06-09-2026, 11:58 AM
                      0 responses
                      110 views
                      0 reactions
                      Last Post SEQadmin2  
                      Working...