Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • cliff
    Member
    • Oct 2009
    • 41

    where to find coordinates for promter, splice site, splice regulatory site?

    Hi, All

    I am trying to look for coordinates of promoters, splice sites, and splice regulatory sites for a whole genome sequencing project. I want to know SNP distribution in those regions. Does anyone here have any experience on that?

    Thanks
  • svl
    Member
    • Sep 2009
    • 43

    #2
    One option (but perhaps not the most complete) is to use the Ensembl SnpEffectPredictor script:
    ftp://ftp.ensembl.org/pub/misc-scrip...predictor_1.0/

    There is also an online version (mirror site, because their main site is down):


    Both will annotate your variants (eg chr:1 123456 A/T) with their effect (eg NON_SYNONYMOUS, but also SPLICE_SITE and ESSENTIAL_SPLICE_SITE)

    This is the list with types it will return:


    /svl

    Comment

    • cliff
      Member
      • Oct 2009
      • 41

      #3
      Originally posted by svl View Post
      One option (but perhaps not the most complete) is to use the Ensembl SnpEffectPredictor script:
      ftp://ftp.ensembl.org/pub/misc-scrip...predictor_1.0/

      There is also an online version (mirror site, because their main site is down):


      Both will annotate your variants (eg chr:1 123456 A/T) with their effect (eg NON_SYNONYMOUS, but also SPLICE_SITE and ESSENTIAL_SPLICE_SITE)

      This is the list with types it will return:


      /svl
      Thanks! Do you know if there is any script or online tool that can deal with data downloaded from UCSC? I am concerned with the possible conflict of the format or coordinate between ensembel and UCSC.

      Comment

      • jdanderson
        Member
        • Sep 2010
        • 45

        #4
        Hello Cliff,

        I don't know if this will help, but you can go to UCSC GB under TABLE header and play with the settings to get a semi customizable list of coordinates; eg TSS, or exon coordinates and have them outputted to GALAXY or download in a few different formats.

        Hopes this helps.

        Regards,
        Johnathon

        Comment

        • cliff
          Member
          • Oct 2009
          • 41

          #5
          Originally posted by jdanderson View Post
          Hello Cliff,

          I don't know if this will help, but you can go to UCSC GB under TABLE header and play with the settings to get a semi customizable list of coordinates; eg TSS, or exon coordinates and have them outputted to GALAXY or download in a few different formats.

          Hopes this helps.

          Regards,
          Johnathon
          Hi, Johnathon

          Thanks for your response! I know we can get exon coordinates from UCSC table browser. Do you know where we can get coordinates for promoter and splice sites?

          -C

          Comment

          • jdanderson
            Member
            • Sep 2010
            • 45

            #6
            Hello Cliff,

            Well I guess I figured the exonic boundaries would be the de facto splice sites (a script could parse the data for you). As for the promoter, that seems like a tough call. Even if by promoter you mean the core (~-35bp)and/or the proximal promoter (~-250-300) not all genes are well characterized in this fashion, to my knowledge (some TATA box, some CpG isl depending on type of gene). If by promoter you mean to include enhancer regions (as is sometimes the case in common language) this is even less well characterized and can be up to -100,000kb (and transcription factor prediction programs aren't much help in my experience). If its of any help you can also find the TSS, which may give some indication of where pol binds. Also, many genes have alternate promoters and TSS's that need to be taken into account.

            Sorry if all of this is old news, just trying to throw some ideas out there. Wish I could be of more help. It sounds like you have an ambitious project in mind. I would be interested in hearing the results, especially on the regulatory side.

            Regards,
            Johnathon

            Comment

            • malachig
              Senior Member
              • Aug 2010
              • 117

              #7
              Those are three pretty big questions. Promoters, splice sites, and splice regulatory elements.

              Promoters. I agree with jdanderson that it depends what you mean by promoters. The 'regulation' tracks available in the UCSC genome browser contain many relevant data sets. As mentioned, one strategy is simply to use transcription start sites themselves as an indicator of where promoters likely reside. A second option is to use preexisting experimental data such as the results of RNA-PolII binding assays or epigenetic profiling by ChIP-Seq. For example, various histone modifications (methylation, acetylation) are associated with transcript initiation and these have been profiled for various tissues. Third, bioinformatic prediction of promoter elements is a huge field in itself. Have you considered cisRED: "databases of genome-wide regulatory module and element predictions"? Fourth, if you want to download a list of high quality annotated regulatory elements and their coordinates I would recommend ORegAnno.

              Splice sites. Again a huge area of research. There are a wide array of gene discovery and splice site prediction tools that will examine a sequence of genomic DNA and tell you the coordinates of possible splice sites. As others have mentioned, it is probably a lot easier to use the exon-exon connections currently present in known transcript models (which are largely based on full-length cDNA sequencing followed by gapped alignment to a reference genome). For example, to get a comprehensive list of splice sites you could use the Table browser of UCSC. Download in BED format the gene table for UCSC genes, CCDS, Ensembl, Refseq, MGC, and Vega. Merging these BED files and extracting the non-redundant set of splice sites for all exons is a relatively straightforward scripting task.

              Splice regulatory elements. This is arguably the most challenging of the three, and an area of very active research. Simply put the regulatory elements that influence splicing beyond the splice sites themselves - i.e. exonic splicing silencers and enhancers (ESSs, ESEs) and intronic splicing silencers and enhancers (ISSs, ISEs) are not well known. The recent advent of RNA-seq technology is arguably going to allow us to really start to perform the experiments needed to begin to characterize these sequences. To learn more about these elements and how they are defined I would recommend 'mechanisms of alternative pre-messenger RNA splicing' by Douglas Black. Some labs with recent publications on the topic of discovering the splicing regulatory code are those of Christopher Burge, Robert Darnell, and Benjamin Blencowe.

              Comment

              • steven
                Senior Member
                • Aug 2009
                • 269

                #8
                A quick and dirty way, all from the UCSC Tables, group="Genes and gene prediction tracks", output format="BED".

                Promoter (kind of): select "Upstream by"= 500 or 1Kb or whatever you want. Of course, consider the limitations described in the posts above.

                Splice sites: select "Introns" and extract their extremities once downloaded (or send it to Galaxy from the previous screen to do it online). I find it easier to get splice sites from introns than from exons -no need to filter TSSs and polyA sites.

                Comment

                • jdanderson
                  Member
                  • Sep 2010
                  • 45

                  #9
                  Hello All,

                  Wow, great last couple of posts. I was especially intrigued by the mention of the two databases (which i was not familiar with), very interesting. Sounds like there is a lot of interesting work being done by the people in here.

                  Somewhat of a side note, you could look at promoter proximal introns which can help regulate expression rates. I don't think many of these motifs are well characterized, although there is an open source algorithm (IMEter) to search for these motifs (somewhat well validated) if you are interested in a set of particular genes/transcripts. See:

                  Promoter-proximal introns in Arabidopsis thaliana are enriched in dispersed signals that elevate gene expression. Plant Cell Rose, A.B., Elfersi, T., Parra, G., and Korf, I. (2008)


                  The IMEter Predicts an Intron's Ability to Boost Gene Expression. Plant Cell Kathleen L. Farquharson (2008)

                  Cheers,
                  Johnathon

                  Comment

                  • bioinfosm
                    Senior Member
                    • Jan 2008
                    • 483

                    #10
                    @malachig, thanks for the very useful post and resources!
                    --
                    bioinfosm

                    Comment

                    • cliff
                      Member
                      • Oct 2009
                      • 41

                      #11
                      wow...just noticed the latest responses.. Thanks very much for your suggestions and comments, especially to SVL, Johnathon , malachig, and steven!!!

                      You guys are awesome!

                      Comment

                      • sindrle
                        Senior Member
                        • Aug 2013
                        • 266

                        #12
                        Originally posted by malachig View Post
                        Download in BED format the gene table for UCSC genes, CCDS, Ensembl, Refseq, MGC, and Vega. Merging these BED files and extracting the non-redundant set of splice sites for all exons is a relatively straightforward scripting task.
                        So, how to do that? :P

                        Comment

                        • GenoMax
                          Senior Member
                          • Feb 2008
                          • 7142

                          #13
                          Originally posted by sindrle View Post
                          So, how to do that? :P
                          Have a look at the Table Browser tutorial: http://genome.ucsc.edu/goldenPath/he...ablesHelp.html You will finally want to select data in BED format for output.

                          You can get the UCSC, CCDS, RefSeq, Ensembl, VEGA, MGC genes by choosing the right tables to query against.

                          That can be followed by BEDTools intersectBed (or an appropriate other option): http://bedtools.readthedocs.org/en/l...intersect.html

                          Comment

                          • sindrle
                            Senior Member
                            • Aug 2013
                            • 266

                            #14
                            Ok, I downloaded all you said and ran this:

                            bedtools intersect -wo -bed -a file1 -b file2 > out1

                            But at the end the output file is 20gb...

                            Tried this instead:

                            unionBedGraphs - file1 -file2 etc

                            But gave error:

                            Assertion failed: (!queue.empty()), function ConsumeNextCoordinate, file unionBedGraphs.cpp, line 99.
                            /usr/bin/unionBedGraphs: line 2: 21166 Abort trap: 6 ${0%/*}/bedtools unionbedg "$@"

                            Comment

                            Latest Articles

                            Collapse

                            • SEQadmin2
                              From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                              by SEQadmin2


                              Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                              The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                              ...
                              06-02-2026, 10:05 AM
                            • SEQadmin2
                              Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                              by SEQadmin2


                              With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                              Introduction

                              Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                              05-22-2026, 06:42 AM
                            • SEQadmin2
                              Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                              by SEQadmin2

                              Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                              Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                              05-06-2026, 09:04 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by SEQadmin2, 06-02-2026, 12:03 PM
                            0 responses
                            21 views
                            0 reactions
                            Last Post SEQadmin2  
                            Started by SEQadmin2, 06-02-2026, 11:40 AM
                            0 responses
                            14 views
                            0 reactions
                            Last Post SEQadmin2  
                            Started by SEQadmin2, 05-28-2026, 11:40 AM
                            0 responses
                            29 views
                            0 reactions
                            Last Post SEQadmin2  
                            Started by SEQadmin2, 05-26-2026, 10:12 AM
                            0 responses
                            31 views
                            0 reactions
                            Last Post SEQadmin2  
                            Working...