SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Splice site prediction with solid rna-seq data Hobbe SOLiD 6 09-08-2011 11:31 AM
Where can I get intron exon juntion information (splice site)? tmy1018 RNA Sequencing 1 07-10-2011 10:44 AM
Splice site mutation Tiaret Bioinformatics 1 06-08-2011 01:14 AM
How to know the definite splice-site zilang1023 Bioinformatics 3 11-24-2009 11:08 PM
Celera Assembler (WGS) - splice site file? dan Bioinformatics 4 09-28-2009 02:56 AM

Reply
 
Thread Tools
Old 10-02-2010, 10:20 AM   #1
cliff
Member
 
Location: USA

Join Date: Oct 2009
Posts: 41
Default where to find coordinates for promter, splice site, splice regulatory site?

Hi, All

I am trying to look for coordinates of promoters, splice sites, and splice regulatory sites for a whole genome sequencing project. I want to know SNP distribution in those regions. Does anyone here have any experience on that?

Thanks
cliff is offline   Reply With Quote
Old 10-03-2010, 02:06 AM   #2
svl
Member
 
Location: Netherlands

Join Date: Sep 2009
Posts: 43
Default

One option (but perhaps not the most complete) is to use the Ensembl SnpEffectPredictor script:
ftp://ftp.ensembl.org/pub/misc-scrip...predictor_1.0/

There is also an online version (mirror site, because their main site is down):
http://useast.ensembl.org/tools.html

Both will annotate your variants (eg chr:1 123456 A/T) with their effect (eg NON_SYNONYMOUS, but also SPLICE_SITE and ESSENTIAL_SPLICE_SITE)

This is the list with types it will return:
http://useast.ensembl.org/info/docs/...ion/index.html

/svl
svl is offline   Reply With Quote
Old 10-03-2010, 09:49 AM   #3
cliff
Member
 
Location: USA

Join Date: Oct 2009
Posts: 41
Default

Quote:
Originally Posted by svl View Post
One option (but perhaps not the most complete) is to use the Ensembl SnpEffectPredictor script:
ftp://ftp.ensembl.org/pub/misc-scrip...predictor_1.0/

There is also an online version (mirror site, because their main site is down):
http://useast.ensembl.org/tools.html

Both will annotate your variants (eg chr:1 123456 A/T) with their effect (eg NON_SYNONYMOUS, but also SPLICE_SITE and ESSENTIAL_SPLICE_SITE)

This is the list with types it will return:
http://useast.ensembl.org/info/docs/...ion/index.html

/svl
Thanks! Do you know if there is any script or online tool that can deal with data downloaded from UCSC? I am concerned with the possible conflict of the format or coordinate between ensembel and UCSC.
cliff is offline   Reply With Quote
Old 10-03-2010, 05:24 PM   #4
jdanderson
Member
 
Location: Davis, CA

Join Date: Sep 2010
Posts: 45
Default

Hello Cliff,

I don't know if this will help, but you can go to UCSC GB under TABLE header and play with the settings to get a semi customizable list of coordinates; eg TSS, or exon coordinates and have them outputted to GALAXY or download in a few different formats.

Hopes this helps.

Regards,
Johnathon
jdanderson is offline   Reply With Quote
Old 10-05-2010, 10:39 AM   #5
cliff
Member
 
Location: USA

Join Date: Oct 2009
Posts: 41
Default

Quote:
Originally Posted by jdanderson View Post
Hello Cliff,

I don't know if this will help, but you can go to UCSC GB under TABLE header and play with the settings to get a semi customizable list of coordinates; eg TSS, or exon coordinates and have them outputted to GALAXY or download in a few different formats.

Hopes this helps.

Regards,
Johnathon
Hi, Johnathon

Thanks for your response! I know we can get exon coordinates from UCSC table browser. Do you know where we can get coordinates for promoter and splice sites?

-C
cliff is offline   Reply With Quote
Old 10-05-2010, 11:31 AM   #6
jdanderson
Member
 
Location: Davis, CA

Join Date: Sep 2010
Posts: 45
Default

Hello Cliff,

Well I guess I figured the exonic boundaries would be the de facto splice sites (a script could parse the data for you). As for the promoter, that seems like a tough call. Even if by promoter you mean the core (~-35bp)and/or the proximal promoter (~-250-300) not all genes are well characterized in this fashion, to my knowledge (some TATA box, some CpG isl depending on type of gene). If by promoter you mean to include enhancer regions (as is sometimes the case in common language) this is even less well characterized and can be up to -100,000kb (and transcription factor prediction programs aren't much help in my experience). If its of any help you can also find the TSS, which may give some indication of where pol binds. Also, many genes have alternate promoters and TSS's that need to be taken into account.

Sorry if all of this is old news, just trying to throw some ideas out there. Wish I could be of more help. It sounds like you have an ambitious project in mind. I would be interested in hearing the results, especially on the regulatory side.

Regards,
Johnathon
jdanderson is offline   Reply With Quote
Old 10-06-2010, 10:25 PM   #7
malachig
Senior Member
 
Location: WashU

Join Date: Aug 2010
Posts: 116
Default

Those are three pretty big questions. Promoters, splice sites, and splice regulatory elements.

Promoters. I agree with jdanderson that it depends what you mean by promoters. The 'regulation' tracks available in the UCSC genome browser contain many relevant data sets. As mentioned, one strategy is simply to use transcription start sites themselves as an indicator of where promoters likely reside. A second option is to use preexisting experimental data such as the results of RNA-PolII binding assays or epigenetic profiling by ChIP-Seq. For example, various histone modifications (methylation, acetylation) are associated with transcript initiation and these have been profiled for various tissues. Third, bioinformatic prediction of promoter elements is a huge field in itself. Have you considered cisRED: "databases of genome-wide regulatory module and element predictions"? Fourth, if you want to download a list of high quality annotated regulatory elements and their coordinates I would recommend ORegAnno.

Splice sites. Again a huge area of research. There are a wide array of gene discovery and splice site prediction tools that will examine a sequence of genomic DNA and tell you the coordinates of possible splice sites. As others have mentioned, it is probably a lot easier to use the exon-exon connections currently present in known transcript models (which are largely based on full-length cDNA sequencing followed by gapped alignment to a reference genome). For example, to get a comprehensive list of splice sites you could use the Table browser of UCSC. Download in BED format the gene table for UCSC genes, CCDS, Ensembl, Refseq, MGC, and Vega. Merging these BED files and extracting the non-redundant set of splice sites for all exons is a relatively straightforward scripting task.

Splice regulatory elements. This is arguably the most challenging of the three, and an area of very active research. Simply put the regulatory elements that influence splicing beyond the splice sites themselves - i.e. exonic splicing silencers and enhancers (ESSs, ESEs) and intronic splicing silencers and enhancers (ISSs, ISEs) are not well known. The recent advent of RNA-seq technology is arguably going to allow us to really start to perform the experiments needed to begin to characterize these sequences. To learn more about these elements and how they are defined I would recommend 'mechanisms of alternative pre-messenger RNA splicing' by Douglas Black. Some labs with recent publications on the topic of discovering the splicing regulatory code are those of Christopher Burge, Robert Darnell, and Benjamin Blencowe.
malachig is offline   Reply With Quote
Old 10-07-2010, 06:21 AM   #8
steven
Senior Member
 
Location: Southern France

Join Date: Aug 2009
Posts: 269
Default

A quick and dirty way, all from the UCSC Tables, group="Genes and gene prediction tracks", output format="BED".

Promoter (kind of): select "Upstream by"= 500 or 1Kb or whatever you want. Of course, consider the limitations described in the posts above.

Splice sites: select "Introns" and extract their extremities once downloaded (or send it to Galaxy from the previous screen to do it online). I find it easier to get splice sites from introns than from exons -no need to filter TSSs and polyA sites.
steven is offline   Reply With Quote
Old 10-07-2010, 11:22 AM   #9
jdanderson
Member
 
Location: Davis, CA

Join Date: Sep 2010
Posts: 45
Default

Hello All,

Wow, great last couple of posts. I was especially intrigued by the mention of the two databases (which i was not familiar with), very interesting. Sounds like there is a lot of interesting work being done by the people in here.

Somewhat of a side note, you could look at promoter proximal introns which can help regulate expression rates. I don't think many of these motifs are well characterized, although there is an open source algorithm (IMEter) to search for these motifs (somewhat well validated) if you are interested in a set of particular genes/transcripts. See:

Promoter-proximal introns in Arabidopsis thaliana are enriched in dispersed signals that elevate gene expression. Plant Cell Rose, A.B., Elfersi, T., Parra, G., and Korf, I. (2008)


The IMEter Predicts an Intron's Ability to Boost Gene Expression. Plant Cell Kathleen L. Farquharson (2008)

Cheers,
Johnathon
jdanderson is offline   Reply With Quote
Old 10-07-2010, 01:37 PM   #10
bioinfosm
Senior Member
 
Location: USA

Join Date: Jan 2008
Posts: 482
Default

@malachig, thanks for the very useful post and resources!
__________________
--
bioinfosm
bioinfosm is offline   Reply With Quote
Old 10-07-2010, 03:25 PM   #11
cliff
Member
 
Location: USA

Join Date: Oct 2009
Posts: 41
Default

wow...just noticed the latest responses.. Thanks very much for your suggestions and comments, especially to SVL, Johnathon , malachig, and steven!!!

You guys are awesome!
cliff is offline   Reply With Quote
Old 11-18-2013, 02:03 PM   #12
sindrle
Senior Member
 
Location: Norway

Join Date: Aug 2013
Posts: 266
Default

Quote:
Originally Posted by malachig View Post
Download in BED format the gene table for UCSC genes, CCDS, Ensembl, Refseq, MGC, and Vega. Merging these BED files and extracting the non-redundant set of splice sites for all exons is a relatively straightforward scripting task.
So, how to do that? :P
sindrle is offline   Reply With Quote
Old 11-18-2013, 02:27 PM   #13
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,404
Default

Quote:
Originally Posted by sindrle View Post
So, how to do that? :P
Have a look at the Table Browser tutorial: http://genome.ucsc.edu/goldenPath/he...ablesHelp.html You will finally want to select data in BED format for output.

You can get the UCSC, CCDS, RefSeq, Ensembl, VEGA, MGC genes by choosing the right tables to query against.

That can be followed by BEDTools intersectBed (or an appropriate other option): http://bedtools.readthedocs.org/en/l...intersect.html
GenoMax is offline   Reply With Quote
Old 11-18-2013, 05:23 PM   #14
sindrle
Senior Member
 
Location: Norway

Join Date: Aug 2013
Posts: 266
Default

Ok, I downloaded all you said and ran this:

bedtools intersect -wo -bed -a file1 -b file2 > out1

But at the end the output file is 20gb...

Tried this instead:

unionBedGraphs - file1 -file2 etc

But gave error:

Assertion failed: (!queue.empty()), function ConsumeNextCoordinate, file unionBedGraphs.cpp, line 99.
/usr/bin/unionBedGraphs: line 2: 21166 Abort trap: 6 ${0%/*}/bedtools unionbedg "$@"
sindrle is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:14 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO