Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
how to convert UCSC gene names to gene symbol gigigou Bioinformatics 5 05-19-2015 04:23 AM
using the symbol @ in the gene nomenclature JulioFinalet Bioinformatics 2 05-13-2014 03:46 AM
How to get gene symbol after deseq? fabrice RNA Sequencing 2 02-12-2014 01:11 PM
Extract multiple sequences on ncbi by gene symbol Giorgio C Bioinformatics 3 01-24-2013 05:49 PM
Retrieving sequences from CD-HIT clusters Tectona Bioinformatics 0 10-03-2012 02:00 AM

Thread Tools
Old 07-19-2016, 03:54 PM   #1
Location: Melbourne

Join Date: Jun 2014
Posts: 14
Default Retrieving promoter sequences using gene symbol

Hi all,

I've been working on this for a few days and don't seem to be getting anywhere.

I have a list of gene symbols (example list below) that I need to use to retrieve promoter sequences for a promoter analysis. Basically I want to use the gene symbol to identify a promoter (for which I expect there may be several promoters) and then use that location to retrieve 1000 nucleotides upstream and maybe 200-500 nt downstream of the promoter.

The two main strategies I have tried were:

1. Download extracted promoter sequences from UCSC download site. Convert gene symbol to refseq ID. Match my gene list to promoters in the pre-compiled fasta of promoter sequences.
PROBLEM: At least some of my refseq IDs don't seem to be found in this precompiled promoter sequence dataset.

2. Use my GTF annotation file to select promoter coordinates from my gene symbols.
PROBLEM: My UCSC GTF files don't appear to contain 5'UTR or whole transcript intervals (only exon and intron intervals). My Annotation file does have the refseq NM_00xxxxx ID though, so I could retrieve those, but where do I find transcript intervals from that? And I only want the primary promoter for each transcript.

If it is helpful, I can program in python - I just need specific help with the direction.

Thanks for the help guys. I really appreciate it.


Appologies if this is a repost - I've seen MANY similar posts, but nothing that I've found particularly helpful (that didn't lead to a dead end).

Example list of gene symbols for which I need promoter sequences:
pkstarstorm05 is offline   Reply With Quote
Old 07-20-2016, 04:18 AM   #2
Senior Member
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,076

Paul: Have you tried BioMart from Ensembl? You can find some help/video's on this page.
GenoMax is offline   Reply With Quote
Old 07-21-2016, 06:49 AM   #3
Senior Member
Location: Geneva

Join Date: Feb 2012
Posts: 179

Using R...

Ref_annotations is your gff file you have to import using the function import.gff2 (with asRangedData=FALSE)
Ref_genome is your genome imported using read.DNAStringSet

The following code should give you the starting base of the first annotated exon of each gene

B <- Ref_annotations[which(seqnames(Ref_annotations) %in% names(Ref_genome))]
C <- B[which(strand(B) == "+")]
f <- as.factor(elementMetadata(C)$gene_name)
rg <- split(C,f)
rh <- unlist(range(rg))
end(rh) <- start(rh)
start(rh) <- start(rh)
names(rh) <- levels(f)
D <- rh
C <- B[which(strand(B) == "-")]
f <- as.factor(elementMetadata(C)$gene_name)
rg <- split(C,f)
rh <- unlist(range(rg))
start(rh) <- end(rh)
end(rh) <- end(rh)
names(rh) <- levels(f)
E <- rh
F <- sort(c(D, E))
Then you can export F as a bed file (function export.bed)

Hope it helps...

Last edited by SylvainL; 07-21-2016 at 06:52 AM.
SylvainL is offline   Reply With Quote
Old 07-31-2016, 02:25 PM   #4
Location: Melbourne

Join Date: Jun 2014
Posts: 14

Hi GenoMax and SylvainL,

Thanks so much for your suggestions and time! They were both very helpful.

For anyone later who comes across this post - I strongly urge you to familiarize yourself with biomaRt. Its a powerful tool for extracting all kinds of useful information.
pkstarstorm05 is offline   Reply With Quote

gene symbol, promoter, transcript interval

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

All times are GMT -8. The time now is 10:50 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO