![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Splice site prediction with solid rna-seq data | Hobbe | SOLiD | 6 | 09-08-2011 12:31 PM |
Where can I get intron exon juntion information (splice site)? | tmy1018 | RNA Sequencing | 1 | 07-10-2011 11:44 AM |
Splice site mutation | Tiaret | Bioinformatics | 1 | 06-08-2011 02:14 AM |
How to know the definite splice-site | zilang1023 | Bioinformatics | 3 | 11-25-2009 12:08 AM |
Celera Assembler (WGS) - splice site file? | dan | Bioinformatics | 4 | 09-28-2009 03:56 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: USA Join Date: Oct 2009
Posts: 41
|
![]()
Hi, All
I am trying to look for coordinates of promoters, splice sites, and splice regulatory sites for a whole genome sequencing project. I want to know SNP distribution in those regions. Does anyone here have any experience on that? Thanks |
![]() |
![]() |
![]() |
#2 |
Member
Location: Netherlands Join Date: Sep 2009
Posts: 43
|
![]()
One option (but perhaps not the most complete) is to use the Ensembl SnpEffectPredictor script:
ftp://ftp.ensembl.org/pub/misc-scrip...predictor_1.0/ There is also an online version (mirror site, because their main site is down): http://useast.ensembl.org/tools.html Both will annotate your variants (eg chr:1 123456 A/T) with their effect (eg NON_SYNONYMOUS, but also SPLICE_SITE and ESSENTIAL_SPLICE_SITE) This is the list with types it will return: http://useast.ensembl.org/info/docs/...ion/index.html /svl |
![]() |
![]() |
![]() |
#3 | |
Member
Location: USA Join Date: Oct 2009
Posts: 41
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
#4 |
Member
Location: Davis, CA Join Date: Sep 2010
Posts: 45
|
![]()
Hello Cliff,
I don't know if this will help, but you can go to UCSC GB under TABLE header and play with the settings to get a semi customizable list of coordinates; eg TSS, or exon coordinates and have them outputted to GALAXY or download in a few different formats. Hopes this helps. Regards, Johnathon |
![]() |
![]() |
![]() |
#5 | |
Member
Location: USA Join Date: Oct 2009
Posts: 41
|
![]() Quote:
Thanks for your response! I know we can get exon coordinates from UCSC table browser. Do you know where we can get coordinates for promoter and splice sites? -C |
|
![]() |
![]() |
![]() |
#6 |
Member
Location: Davis, CA Join Date: Sep 2010
Posts: 45
|
![]()
Hello Cliff,
Well I guess I figured the exonic boundaries would be the de facto splice sites (a script could parse the data for you). As for the promoter, that seems like a tough call. Even if by promoter you mean the core (~-35bp)and/or the proximal promoter (~-250-300) not all genes are well characterized in this fashion, to my knowledge (some TATA box, some CpG isl depending on type of gene). If by promoter you mean to include enhancer regions (as is sometimes the case in common language) this is even less well characterized and can be up to -100,000kb (and transcription factor prediction programs aren't much help in my experience). If its of any help you can also find the TSS, which may give some indication of where pol binds. Also, many genes have alternate promoters and TSS's that need to be taken into account. Sorry if all of this is old news, just trying to throw some ideas out there. Wish I could be of more help. It sounds like you have an ambitious project in mind. I would be interested in hearing the results, especially on the regulatory side. Regards, Johnathon |
![]() |
![]() |
![]() |
#7 |
Senior Member
Location: WashU Join Date: Aug 2010
Posts: 117
|
![]()
Those are three pretty big questions. Promoters, splice sites, and splice regulatory elements.
Promoters. I agree with jdanderson that it depends what you mean by promoters. The 'regulation' tracks available in the UCSC genome browser contain many relevant data sets. As mentioned, one strategy is simply to use transcription start sites themselves as an indicator of where promoters likely reside. A second option is to use preexisting experimental data such as the results of RNA-PolII binding assays or epigenetic profiling by ChIP-Seq. For example, various histone modifications (methylation, acetylation) are associated with transcript initiation and these have been profiled for various tissues. Third, bioinformatic prediction of promoter elements is a huge field in itself. Have you considered cisRED: "databases of genome-wide regulatory module and element predictions"? Fourth, if you want to download a list of high quality annotated regulatory elements and their coordinates I would recommend ORegAnno. Splice sites. Again a huge area of research. There are a wide array of gene discovery and splice site prediction tools that will examine a sequence of genomic DNA and tell you the coordinates of possible splice sites. As others have mentioned, it is probably a lot easier to use the exon-exon connections currently present in known transcript models (which are largely based on full-length cDNA sequencing followed by gapped alignment to a reference genome). For example, to get a comprehensive list of splice sites you could use the Table browser of UCSC. Download in BED format the gene table for UCSC genes, CCDS, Ensembl, Refseq, MGC, and Vega. Merging these BED files and extracting the non-redundant set of splice sites for all exons is a relatively straightforward scripting task. Splice regulatory elements. This is arguably the most challenging of the three, and an area of very active research. Simply put the regulatory elements that influence splicing beyond the splice sites themselves - i.e. exonic splicing silencers and enhancers (ESSs, ESEs) and intronic splicing silencers and enhancers (ISSs, ISEs) are not well known. The recent advent of RNA-seq technology is arguably going to allow us to really start to perform the experiments needed to begin to characterize these sequences. To learn more about these elements and how they are defined I would recommend 'mechanisms of alternative pre-messenger RNA splicing' by Douglas Black. Some labs with recent publications on the topic of discovering the splicing regulatory code are those of Christopher Burge, Robert Darnell, and Benjamin Blencowe. |
![]() |
![]() |
![]() |
#8 |
Senior Member
Location: Southern France Join Date: Aug 2009
Posts: 269
|
![]()
A quick and dirty way, all from the UCSC Tables, group="Genes and gene prediction tracks", output format="BED".
Promoter (kind of): select "Upstream by"= 500 or 1Kb or whatever you want. Of course, consider the limitations described in the posts above. Splice sites: select "Introns" and extract their extremities once downloaded (or send it to Galaxy from the previous screen to do it online). I find it easier to get splice sites from introns than from exons -no need to filter TSSs and polyA sites. |
![]() |
![]() |
![]() |
#9 |
Member
Location: Davis, CA Join Date: Sep 2010
Posts: 45
|
![]()
Hello All,
Wow, great last couple of posts. I was especially intrigued by the mention of the two databases (which i was not familiar with), very interesting. Sounds like there is a lot of interesting work being done by the people in here. Somewhat of a side note, you could look at promoter proximal introns which can help regulate expression rates. I don't think many of these motifs are well characterized, although there is an open source algorithm (IMEter) to search for these motifs (somewhat well validated) if you are interested in a set of particular genes/transcripts. See: Promoter-proximal introns in Arabidopsis thaliana are enriched in dispersed signals that elevate gene expression. Plant Cell Rose, A.B., Elfersi, T., Parra, G., and Korf, I. (2008) The IMEter Predicts an Intron's Ability to Boost Gene Expression. Plant Cell Kathleen L. Farquharson (2008) Cheers, Johnathon |
![]() |
![]() |
![]() |
#11 |
Member
Location: USA Join Date: Oct 2009
Posts: 41
|
![]()
wow...just noticed the latest responses.. Thanks very much for your suggestions and comments, especially to SVL, Johnathon , malachig, and steven!!!
You guys are awesome! |
![]() |
![]() |
![]() |
#12 |
Senior Member
Location: Norway Join Date: Aug 2013
Posts: 266
|
![]() |
![]() |
![]() |
![]() |
#13 |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,088
|
![]()
Have a look at the Table Browser tutorial: http://genome.ucsc.edu/goldenPath/he...ablesHelp.html You will finally want to select data in BED format for output.
You can get the UCSC, CCDS, RefSeq, Ensembl, VEGA, MGC genes by choosing the right tables to query against. That can be followed by BEDTools intersectBed (or an appropriate other option): http://bedtools.readthedocs.org/en/l...intersect.html |
![]() |
![]() |
![]() |
#14 |
Senior Member
Location: Norway Join Date: Aug 2013
Posts: 266
|
![]()
Ok, I downloaded all you said and ran this:
bedtools intersect -wo -bed -a file1 -b file2 > out1 But at the end the output file is 20gb... Tried this instead: unionBedGraphs - file1 -file2 etc But gave error: Assertion failed: (!queue.empty()), function ConsumeNextCoordinate, file unionBedGraphs.cpp, line 99. /usr/bin/unionBedGraphs: line 2: 21166 Abort trap: 6 ${0%/*}/bedtools unionbedg "$@" |
![]() |
![]() |
![]() |
Thread Tools | |
|
|