Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Annotating a BED file for UTRs -- which data to use?

    I've got a BED of intervals that I need to annotate for 3' or 5' UTRs. I've been trying to work with biomart and UCSC table browser to produce a relevant GFF/GTF/BED file containing the annotated UTR regions, but have not managed to get what I want yet.

    Does anyone know how to use either UCSC/biomart to get this information, or perhaps another way? Any recommendations useful!

  • #2
    The BEDOPS suite contains the bedmap application, which is perfect for annotation tasks involving BED files.

    The bedmap application takes three inputs:
    1. Operators
    2. Reference BED file (e.g., your intervals)
    3. Map BED file (e.g., your regions denoting "UTR" elements)


    One or more operations are applied on elements of the Map file that overlap elements in the Reference file by one or more bases. These operations can calculate a numerical result, or summarize ID, range or other information about mapped elements.

    I'll show you below how you can use bedmap with the --echo-map-id operator to grab UTR IDs.

    As an example, let's say your intervals are the following regions, adjacent 2000 bp windows that are stored in the tab-delimited file Intervals.bed:

    $ more Intervals.bed
    chr1 10000 12000
    chr1 12000 14000
    chr1 14000 16000
    ...


    Let's say your UTRs are stored in a file called UTRs.bed.

    (If you don't have this yet, you can grab these for the entire genome from the UCSC Genome Browser's table browser, through the Genes and Gene Prediction Tracks group and the knownGene table. Change the output format to BED and click get output, selecting the desired UTR category. Save the resulting BED file to your file system.)

    For hg19, here are a few of the first 5' UTRs over the genome, taken from the UCSC Genome Browser:

    $ more UTRs.bed
    chr1 11873 12189 uc010nxq.1_utr5_0_0_chr1_11874_f 0 +
    chr1 14361 14829 uc009vis.3_utr5_0_0_chr1_14362_r 0 -
    chr1 14969 15038 uc009vis.3_utr5_1_0_chr1_14970_r 0 -
    chr1 15795 15942 uc009vis.3_utr5_2_0_chr1_15796_r 0 -
    chr1 16606 16765 uc009vis.3_utr5_3_0_chr1_16607_r 0 -
    ...


    We can find the IDs of the UTRs that overlap our Intervals.bed elements with the following command:

    $ bedmap --echo --echo-map-id --delim '\t' Intervals.bed UTRs.bed > UTR_IDs_overlapping_Intervals.bed

    Let's take a look at the results:

    $ more UTR_IDs_overlapping_Intervals.bed
    chr1 10000 12000 uc010nxq.1_utr5_0_0_chr1_11874_f
    chr1 12000 14000
    chr1 14000 16000 uc009vis.3_utr5_0_0_chr1_14362_r;uc009vis.3_utr5_1_0_chr1_14970_r;uc009vis.3_utr5_2_0_chr1_15796_r
    ...


    In other words, the UTR uc010nxq.1_utr5_0_0_chr1_11874_f overlaps interval chr1:10000-12000, no UTRs overlap the interval chr1:12000-14000, three UTRs overlap the interval chr1:14000-16000 and so on...

    Let's explain what the command did. We used three operators with this application of bedmap:
    1. --echo
    2. --echo-map-id
    3. --delim


    The --echo operator prints each of the interval elements.

    The --echo-map-id operator prints each of the UTR IDs that overlap the specified interval element. If there are multiple IDs, they are delimited with a semi-colon.

    The --delim operator separates the --echo and --echo-map-id results with a tab character. This allows the output to remain a relaxed, three-column UCSC BED file, so you can process this with any downstream tools that take in BED data (like bedmap, bedops, or other BEDOPS utilities).

    You don't have to just grab IDs from the UTR file. If you want the entire UTR element, use --echo-map. If you want just the UTR regions, use --echo-map-range.

    There are several other --echo-map-* operators which summarize different information from mapped elements. Check out the Echo section of the bedmap documentation for more detail.

    Note: BEDOPS utilities run very fast and use very little memory, compared with alternative toolkits which do not yet use our design optimizations. This is because we require that BED inputs are sorted (at most, only once), which adds structure that we can take advantage of. Data you get through the UCSC Genome Browser will be sorted. However, you might need to sort data in your intervals file, if you do not know its sort status. This is easy and relatively quick, making use of the BEDOPS sort-bed tool:

    $ sort-bed Intervals.bed > Sorted_Intervals.bed

    Again, this only needs to be done once. BEDOPS tools read and write sorted data, so any results from, say, bedmap do not need any further sorting before using results with downstream tools.
    Last edited by AlexReynolds; 02-21-2013, 01:47 PM.

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Essential Discoveries and Tools in Epitranscriptomics
      by seqadmin




      The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
      04-22-2024, 07:01 AM
    • seqadmin
      Current Approaches to Protein Sequencing
      by seqadmin


      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
      04-04-2024, 04:25 PM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, Yesterday, 10:49 AM
    0 responses
    18 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-25-2024, 11:49 AM
    0 responses
    24 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-24-2024, 08:47 AM
    0 responses
    20 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-11-2024, 12:08 PM
    0 responses
    62 views
    0 likes
    Last Post seqadmin  
    Working...
    X