SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   Bioinformatics (http://seqanswers.com/forums/forumdisplay.php?f=18)
-   -   where to find coordinates for promter, splice site, splice regulatory site? (http://seqanswers.com/forums/showthread.php?t=7113)

cliff 10-02-2010 10:20 AM

where to find coordinates for promter, splice site, splice regulatory site?
 
Hi, All

I am trying to look for coordinates of promoters, splice sites, and splice regulatory sites for a whole genome sequencing project. I want to know SNP distribution in those regions. Does anyone here have any experience on that?

Thanks

svl 10-03-2010 02:06 AM

One option (but perhaps not the most complete) is to use the Ensembl SnpEffectPredictor script:
ftp://ftp.ensembl.org/pub/misc-scrip...predictor_1.0/

There is also an online version (mirror site, because their main site is down):
http://useast.ensembl.org/tools.html

Both will annotate your variants (eg chr:1 123456 A/T) with their effect (eg NON_SYNONYMOUS, but also SPLICE_SITE and ESSENTIAL_SPLICE_SITE)

This is the list with types it will return:
http://useast.ensembl.org/info/docs/...ion/index.html

/svl

cliff 10-03-2010 09:49 AM

Quote:

Originally Posted by svl (Post 26326)
One option (but perhaps not the most complete) is to use the Ensembl SnpEffectPredictor script:
ftp://ftp.ensembl.org/pub/misc-scrip...predictor_1.0/

There is also an online version (mirror site, because their main site is down):
http://useast.ensembl.org/tools.html

Both will annotate your variants (eg chr:1 123456 A/T) with their effect (eg NON_SYNONYMOUS, but also SPLICE_SITE and ESSENTIAL_SPLICE_SITE)

This is the list with types it will return:
http://useast.ensembl.org/info/docs/...ion/index.html

/svl

Thanks! Do you know if there is any script or online tool that can deal with data downloaded from UCSC? I am concerned with the possible conflict of the format or coordinate between ensembel and UCSC.

jdanderson 10-03-2010 05:24 PM

Hello Cliff,

I don't know if this will help, but you can go to UCSC GB under TABLE header and play with the settings to get a semi customizable list of coordinates; eg TSS, or exon coordinates and have them outputted to GALAXY or download in a few different formats.

Hopes this helps.

Regards,
Johnathon

cliff 10-05-2010 10:39 AM

Quote:

Originally Posted by jdanderson (Post 26338)
Hello Cliff,

I don't know if this will help, but you can go to UCSC GB under TABLE header and play with the settings to get a semi customizable list of coordinates; eg TSS, or exon coordinates and have them outputted to GALAXY or download in a few different formats.

Hopes this helps.

Regards,
Johnathon

Hi, Johnathon

Thanks for your response! I know we can get exon coordinates from UCSC table browser. Do you know where we can get coordinates for promoter and splice sites?

-C

jdanderson 10-05-2010 11:31 AM

Hello Cliff,

Well I guess I figured the exonic boundaries would be the de facto splice sites (a script could parse the data for you). As for the promoter, that seems like a tough call. Even if by promoter you mean the core (~-35bp)and/or the proximal promoter (~-250-300) not all genes are well characterized in this fashion, to my knowledge (some TATA box, some CpG isl depending on type of gene). If by promoter you mean to include enhancer regions (as is sometimes the case in common language) this is even less well characterized and can be up to -100,000kb (and transcription factor prediction programs aren't much help in my experience). If its of any help you can also find the TSS, which may give some indication of where pol binds. Also, many genes have alternate promoters and TSS's that need to be taken into account.

Sorry if all of this is old news, just trying to throw some ideas out there. Wish I could be of more help. It sounds like you have an ambitious project in mind. I would be interested in hearing the results, especially on the regulatory side.

Regards,
Johnathon

malachig 10-06-2010 10:25 PM

Those are three pretty big questions. Promoters, splice sites, and splice regulatory elements.

Promoters. I agree with jdanderson that it depends what you mean by promoters. The 'regulation' tracks available in the UCSC genome browser contain many relevant data sets. As mentioned, one strategy is simply to use transcription start sites themselves as an indicator of where promoters likely reside. A second option is to use preexisting experimental data such as the results of RNA-PolII binding assays or epigenetic profiling by ChIP-Seq. For example, various histone modifications (methylation, acetylation) are associated with transcript initiation and these have been profiled for various tissues. Third, bioinformatic prediction of promoter elements is a huge field in itself. Have you considered cisRED: "databases of genome-wide regulatory module and element predictions"? Fourth, if you want to download a list of high quality annotated regulatory elements and their coordinates I would recommend ORegAnno.

Splice sites. Again a huge area of research. There are a wide array of gene discovery and splice site prediction tools that will examine a sequence of genomic DNA and tell you the coordinates of possible splice sites. As others have mentioned, it is probably a lot easier to use the exon-exon connections currently present in known transcript models (which are largely based on full-length cDNA sequencing followed by gapped alignment to a reference genome). For example, to get a comprehensive list of splice sites you could use the Table browser of UCSC. Download in BED format the gene table for UCSC genes, CCDS, Ensembl, Refseq, MGC, and Vega. Merging these BED files and extracting the non-redundant set of splice sites for all exons is a relatively straightforward scripting task.

Splice regulatory elements. This is arguably the most challenging of the three, and an area of very active research. Simply put the regulatory elements that influence splicing beyond the splice sites themselves - i.e. exonic splicing silencers and enhancers (ESSs, ESEs) and intronic splicing silencers and enhancers (ISSs, ISEs) are not well known. The recent advent of RNA-seq technology is arguably going to allow us to really start to perform the experiments needed to begin to characterize these sequences. To learn more about these elements and how they are defined I would recommend 'mechanisms of alternative pre-messenger RNA splicing' by Douglas Black. Some labs with recent publications on the topic of discovering the splicing regulatory code are those of Christopher Burge, Robert Darnell, and Benjamin Blencowe.

steven 10-07-2010 06:21 AM

A quick and dirty way, all from the UCSC Tables, group="Genes and gene prediction tracks", output format="BED".

Promoter (kind of): select "Upstream by"= 500 or 1Kb or whatever you want. Of course, consider the limitations described in the posts above.

Splice sites: select "Introns" and extract their extremities once downloaded (or send it to Galaxy from the previous screen to do it online). I find it easier to get splice sites from introns than from exons -no need to filter TSSs and polyA sites.

jdanderson 10-07-2010 11:22 AM

Hello All,

Wow, great last couple of posts. I was especially intrigued by the mention of the two databases (which i was not familiar with), very interesting. Sounds like there is a lot of interesting work being done by the people in here.

Somewhat of a side note, you could look at promoter proximal introns which can help regulate expression rates. I don't think many of these motifs are well characterized, although there is an open source algorithm (IMEter) to search for these motifs (somewhat well validated) if you are interested in a set of particular genes/transcripts. See:

Promoter-proximal introns in Arabidopsis thaliana are enriched in dispersed signals that elevate gene expression. Plant Cell Rose, A.B., Elfersi, T., Parra, G., and Korf, I. (2008)


The IMEter Predicts an Intron's Ability to Boost Gene Expression. Plant Cell Kathleen L. Farquharson (2008)

Cheers,
Johnathon

bioinfosm 10-07-2010 01:37 PM

@malachig, thanks for the very useful post and resources!

cliff 10-07-2010 03:25 PM

wow...just noticed the latest responses.. Thanks very much for your suggestions and comments, especially to SVL, Johnathon , malachig, and steven!!!

You guys are awesome!

sindrle 11-18-2013 02:03 PM

Quote:

Originally Posted by malachig (Post 26636)
Download in BED format the gene table for UCSC genes, CCDS, Ensembl, Refseq, MGC, and Vega. Merging these BED files and extracting the non-redundant set of splice sites for all exons is a relatively straightforward scripting task.

So, how to do that? :P

GenoMax 11-18-2013 02:27 PM

Quote:

Originally Posted by sindrle (Post 122103)
So, how to do that? :P

Have a look at the Table Browser tutorial: http://genome.ucsc.edu/goldenPath/he...ablesHelp.html You will finally want to select data in BED format for output.

You can get the UCSC, CCDS, RefSeq, Ensembl, VEGA, MGC genes by choosing the right tables to query against.

That can be followed by BEDTools intersectBed (or an appropriate other option): http://bedtools.readthedocs.org/en/l...intersect.html

sindrle 11-18-2013 05:23 PM

Ok, I downloaded all you said and ran this:

bedtools intersect -wo -bed -a file1 -b file2 > out1

But at the end the output file is 20gb...

Tried this instead:

unionBedGraphs - file1 -file2 etc

But gave error:

Assertion failed: (!queue.empty()), function ConsumeNextCoordinate, file unionBedGraphs.cpp, line 99.
/usr/bin/unionBedGraphs: line 2: 21166 Abort trap: 6 ${0%/*}/bedtools unionbedg "$@"


All times are GMT -8. The time now is 06:53 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.