Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Retrieving promoter sequences using gene symbol

    Hi all,

    I've been working on this for a few days and don't seem to be getting anywhere.

    I have a list of gene symbols (example list below) that I need to use to retrieve promoter sequences for a promoter analysis. Basically I want to use the gene symbol to identify a promoter (for which I expect there may be several promoters) and then use that location to retrieve 1000 nucleotides upstream and maybe 200-500 nt downstream of the promoter.

    The two main strategies I have tried were:

    1. Download extracted promoter sequences from UCSC download site. Convert gene symbol to refseq ID. Match my gene list to promoters in the pre-compiled fasta of promoter sequences.
    PROBLEM: At least some of my refseq IDs don't seem to be found in this precompiled promoter sequence dataset.

    2. Use my GTF annotation file to select promoter coordinates from my gene symbols.
    PROBLEM: My UCSC GTF files don't appear to contain 5'UTR or whole transcript intervals (only exon and intron intervals). My Annotation file does have the refseq NM_00xxxxx ID though, so I could retrieve those, but where do I find transcript intervals from that? And I only want the primary promoter for each transcript.

    If it is helpful, I can program in python - I just need specific help with the direction.

    Thanks for the help guys. I really appreciate it.

    Paul

    Appologies if this is a repost - I've seen MANY similar posts, but nothing that I've found particularly helpful (that didn't lead to a dead end).

    Example list of gene symbols for which I need promoter sequences:
    Snrpd2
    Snrpe
    Snrpg
    Snrpn
    Snx11
    Socs1
    Sod1
    Sox11
    Sox12
    Sox4
    Sphk1
    Spin2c

  • #2
    Paul: Have you tried BioMart from Ensembl? You can find some help/video's on this page.

    Comment


    • #3
      Using R...

      Ref_annotations is your gff file you have to import using the function import.gff2 (with asRangedData=FALSE)
      Ref_genome is your genome imported using read.DNAStringSet

      The following code should give you the starting base of the first annotated exon of each gene

      Code:
      B <- Ref_annotations[which(seqnames(Ref_annotations) %in% names(Ref_genome))]
      C <- B[which(strand(B) == "+")]
      f <- as.factor(elementMetadata(C)$gene_name)
      rg <- split(C,f)
      rh <- unlist(range(rg))
      end(rh) <- start(rh)
      start(rh) <- start(rh)
      names(rh) <- levels(f)
      D <- rh
      C <- B[which(strand(B) == "-")]
      f <- as.factor(elementMetadata(C)$gene_name)
      rg <- split(C,f)
      rh <- unlist(range(rg))
      start(rh) <- end(rh)
      end(rh) <- end(rh)
      names(rh) <- levels(f)
      E <- rh
      F <- sort(c(D, E))
      Then you can export F as a bed file (function export.bed)

      Hope it helps...
      Last edited by SylvainL; 07-21-2016, 06:52 AM.

      Comment


      • #4
        Hi GenoMax and SylvainL,

        Thanks so much for your suggestions and time! They were both very helpful.

        For anyone later who comes across this post - I strongly urge you to familiarize yourself with biomaRt. Its a powerful tool for extracting all kinds of useful information.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM
        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin


          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        18 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        22 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        17 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-04-2024, 09:00 AM
        0 responses
        48 views
        0 likes
        Last Post seqadmin  
        Working...
        X