Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • where to find coordinates for promter, splice site, splice regulatory site?

    Hi, All

    I am trying to look for coordinates of promoters, splice sites, and splice regulatory sites for a whole genome sequencing project. I want to know SNP distribution in those regions. Does anyone here have any experience on that?

    Thanks

  • #2
    One option (but perhaps not the most complete) is to use the Ensembl SnpEffectPredictor script:
    ftp://ftp.ensembl.org/pub/misc-scrip...predictor_1.0/

    There is also an online version (mirror site, because their main site is down):


    Both will annotate your variants (eg chr:1 123456 A/T) with their effect (eg NON_SYNONYMOUS, but also SPLICE_SITE and ESSENTIAL_SPLICE_SITE)

    This is the list with types it will return:


    /svl

    Comment


    • #3
      Originally posted by svl View Post
      One option (but perhaps not the most complete) is to use the Ensembl SnpEffectPredictor script:
      ftp://ftp.ensembl.org/pub/misc-scrip...predictor_1.0/

      There is also an online version (mirror site, because their main site is down):


      Both will annotate your variants (eg chr:1 123456 A/T) with their effect (eg NON_SYNONYMOUS, but also SPLICE_SITE and ESSENTIAL_SPLICE_SITE)

      This is the list with types it will return:


      /svl
      Thanks! Do you know if there is any script or online tool that can deal with data downloaded from UCSC? I am concerned with the possible conflict of the format or coordinate between ensembel and UCSC.

      Comment


      • #4
        Hello Cliff,

        I don't know if this will help, but you can go to UCSC GB under TABLE header and play with the settings to get a semi customizable list of coordinates; eg TSS, or exon coordinates and have them outputted to GALAXY or download in a few different formats.

        Hopes this helps.

        Regards,
        Johnathon

        Comment


        • #5
          Originally posted by jdanderson View Post
          Hello Cliff,

          I don't know if this will help, but you can go to UCSC GB under TABLE header and play with the settings to get a semi customizable list of coordinates; eg TSS, or exon coordinates and have them outputted to GALAXY or download in a few different formats.

          Hopes this helps.

          Regards,
          Johnathon
          Hi, Johnathon

          Thanks for your response! I know we can get exon coordinates from UCSC table browser. Do you know where we can get coordinates for promoter and splice sites?

          -C

          Comment


          • #6
            Hello Cliff,

            Well I guess I figured the exonic boundaries would be the de facto splice sites (a script could parse the data for you). As for the promoter, that seems like a tough call. Even if by promoter you mean the core (~-35bp)and/or the proximal promoter (~-250-300) not all genes are well characterized in this fashion, to my knowledge (some TATA box, some CpG isl depending on type of gene). If by promoter you mean to include enhancer regions (as is sometimes the case in common language) this is even less well characterized and can be up to -100,000kb (and transcription factor prediction programs aren't much help in my experience). If its of any help you can also find the TSS, which may give some indication of where pol binds. Also, many genes have alternate promoters and TSS's that need to be taken into account.

            Sorry if all of this is old news, just trying to throw some ideas out there. Wish I could be of more help. It sounds like you have an ambitious project in mind. I would be interested in hearing the results, especially on the regulatory side.

            Regards,
            Johnathon

            Comment


            • #7
              Those are three pretty big questions. Promoters, splice sites, and splice regulatory elements.

              Promoters. I agree with jdanderson that it depends what you mean by promoters. The 'regulation' tracks available in the UCSC genome browser contain many relevant data sets. As mentioned, one strategy is simply to use transcription start sites themselves as an indicator of where promoters likely reside. A second option is to use preexisting experimental data such as the results of RNA-PolII binding assays or epigenetic profiling by ChIP-Seq. For example, various histone modifications (methylation, acetylation) are associated with transcript initiation and these have been profiled for various tissues. Third, bioinformatic prediction of promoter elements is a huge field in itself. Have you considered cisRED: "databases of genome-wide regulatory module and element predictions"? Fourth, if you want to download a list of high quality annotated regulatory elements and their coordinates I would recommend ORegAnno.

              Splice sites. Again a huge area of research. There are a wide array of gene discovery and splice site prediction tools that will examine a sequence of genomic DNA and tell you the coordinates of possible splice sites. As others have mentioned, it is probably a lot easier to use the exon-exon connections currently present in known transcript models (which are largely based on full-length cDNA sequencing followed by gapped alignment to a reference genome). For example, to get a comprehensive list of splice sites you could use the Table browser of UCSC. Download in BED format the gene table for UCSC genes, CCDS, Ensembl, Refseq, MGC, and Vega. Merging these BED files and extracting the non-redundant set of splice sites for all exons is a relatively straightforward scripting task.

              Splice regulatory elements. This is arguably the most challenging of the three, and an area of very active research. Simply put the regulatory elements that influence splicing beyond the splice sites themselves - i.e. exonic splicing silencers and enhancers (ESSs, ESEs) and intronic splicing silencers and enhancers (ISSs, ISEs) are not well known. The recent advent of RNA-seq technology is arguably going to allow us to really start to perform the experiments needed to begin to characterize these sequences. To learn more about these elements and how they are defined I would recommend 'mechanisms of alternative pre-messenger RNA splicing' by Douglas Black. Some labs with recent publications on the topic of discovering the splicing regulatory code are those of Christopher Burge, Robert Darnell, and Benjamin Blencowe.

              Comment


              • #8
                A quick and dirty way, all from the UCSC Tables, group="Genes and gene prediction tracks", output format="BED".

                Promoter (kind of): select "Upstream by"= 500 or 1Kb or whatever you want. Of course, consider the limitations described in the posts above.

                Splice sites: select "Introns" and extract their extremities once downloaded (or send it to Galaxy from the previous screen to do it online). I find it easier to get splice sites from introns than from exons -no need to filter TSSs and polyA sites.

                Comment


                • #9
                  Hello All,

                  Wow, great last couple of posts. I was especially intrigued by the mention of the two databases (which i was not familiar with), very interesting. Sounds like there is a lot of interesting work being done by the people in here.

                  Somewhat of a side note, you could look at promoter proximal introns which can help regulate expression rates. I don't think many of these motifs are well characterized, although there is an open source algorithm (IMEter) to search for these motifs (somewhat well validated) if you are interested in a set of particular genes/transcripts. See:

                  Promoter-proximal introns in Arabidopsis thaliana are enriched in dispersed signals that elevate gene expression. Plant Cell Rose, A.B., Elfersi, T., Parra, G., and Korf, I. (2008)


                  The IMEter Predicts an Intron's Ability to Boost Gene Expression. Plant Cell Kathleen L. Farquharson (2008)

                  Cheers,
                  Johnathon

                  Comment


                  • #10
                    @malachig, thanks for the very useful post and resources!
                    --
                    bioinfosm

                    Comment


                    • #11
                      wow...just noticed the latest responses.. Thanks very much for your suggestions and comments, especially to SVL, Johnathon , malachig, and steven!!!

                      You guys are awesome!

                      Comment


                      • #12
                        Originally posted by malachig View Post
                        Download in BED format the gene table for UCSC genes, CCDS, Ensembl, Refseq, MGC, and Vega. Merging these BED files and extracting the non-redundant set of splice sites for all exons is a relatively straightforward scripting task.
                        So, how to do that? :P

                        Comment


                        • #13
                          Originally posted by sindrle View Post
                          So, how to do that? :P
                          Have a look at the Table Browser tutorial: http://genome.ucsc.edu/goldenPath/he...ablesHelp.html You will finally want to select data in BED format for output.

                          You can get the UCSC, CCDS, RefSeq, Ensembl, VEGA, MGC genes by choosing the right tables to query against.

                          That can be followed by BEDTools intersectBed (or an appropriate other option): http://bedtools.readthedocs.org/en/l...intersect.html

                          Comment


                          • #14
                            Ok, I downloaded all you said and ran this:

                            bedtools intersect -wo -bed -a file1 -b file2 > out1

                            But at the end the output file is 20gb...

                            Tried this instead:

                            unionBedGraphs - file1 -file2 etc

                            But gave error:

                            Assertion failed: (!queue.empty()), function ConsumeNextCoordinate, file unionBedGraphs.cpp, line 99.
                            /usr/bin/unionBedGraphs: line 2: 21166 Abort trap: 6 ${0%/*}/bedtools unionbedg "$@"

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Techniques and Challenges in Conservation Genomics
                              by seqadmin



                              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                              Avian Conservation
                              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                              03-08-2024, 10:41 AM
                            • seqadmin
                              The Impact of AI in Genomic Medicine
                              by seqadmin



                              Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                              02-26-2024, 02:07 PM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, 03-14-2024, 06:13 AM
                            0 responses
                            32 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-08-2024, 08:03 AM
                            0 responses
                            71 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-07-2024, 08:13 AM
                            0 responses
                            80 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-06-2024, 09:51 AM
                            0 responses
                            68 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X