Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Annotating a BED file for UTRs -- which data to use?

    I've got a BED of intervals that I need to annotate for 3' or 5' UTRs. I've been trying to work with biomart and UCSC table browser to produce a relevant GFF/GTF/BED file containing the annotated UTR regions, but have not managed to get what I want yet.

    Does anyone know how to use either UCSC/biomart to get this information, or perhaps another way? Any recommendations useful!

  • #2
    The BEDOPS suite contains the bedmap application, which is perfect for annotation tasks involving BED files.

    The bedmap application takes three inputs:
    1. Operators
    2. Reference BED file (e.g., your intervals)
    3. Map BED file (e.g., your regions denoting "UTR" elements)


    One or more operations are applied on elements of the Map file that overlap elements in the Reference file by one or more bases. These operations can calculate a numerical result, or summarize ID, range or other information about mapped elements.

    I'll show you below how you can use bedmap with the --echo-map-id operator to grab UTR IDs.

    As an example, let's say your intervals are the following regions, adjacent 2000 bp windows that are stored in the tab-delimited file Intervals.bed:

    $ more Intervals.bed
    chr1 10000 12000
    chr1 12000 14000
    chr1 14000 16000
    ...


    Let's say your UTRs are stored in a file called UTRs.bed.

    (If you don't have this yet, you can grab these for the entire genome from the UCSC Genome Browser's table browser, through the Genes and Gene Prediction Tracks group and the knownGene table. Change the output format to BED and click get output, selecting the desired UTR category. Save the resulting BED file to your file system.)

    For hg19, here are a few of the first 5' UTRs over the genome, taken from the UCSC Genome Browser:

    $ more UTRs.bed
    chr1 11873 12189 uc010nxq.1_utr5_0_0_chr1_11874_f 0 +
    chr1 14361 14829 uc009vis.3_utr5_0_0_chr1_14362_r 0 -
    chr1 14969 15038 uc009vis.3_utr5_1_0_chr1_14970_r 0 -
    chr1 15795 15942 uc009vis.3_utr5_2_0_chr1_15796_r 0 -
    chr1 16606 16765 uc009vis.3_utr5_3_0_chr1_16607_r 0 -
    ...


    We can find the IDs of the UTRs that overlap our Intervals.bed elements with the following command:

    $ bedmap --echo --echo-map-id --delim '\t' Intervals.bed UTRs.bed > UTR_IDs_overlapping_Intervals.bed

    Let's take a look at the results:

    $ more UTR_IDs_overlapping_Intervals.bed
    chr1 10000 12000 uc010nxq.1_utr5_0_0_chr1_11874_f
    chr1 12000 14000
    chr1 14000 16000 uc009vis.3_utr5_0_0_chr1_14362_r;uc009vis.3_utr5_1_0_chr1_14970_r;uc009vis.3_utr5_2_0_chr1_15796_r
    ...


    In other words, the UTR uc010nxq.1_utr5_0_0_chr1_11874_f overlaps interval chr1:10000-12000, no UTRs overlap the interval chr1:12000-14000, three UTRs overlap the interval chr1:14000-16000 and so on...

    Let's explain what the command did. We used three operators with this application of bedmap:
    1. --echo
    2. --echo-map-id
    3. --delim


    The --echo operator prints each of the interval elements.

    The --echo-map-id operator prints each of the UTR IDs that overlap the specified interval element. If there are multiple IDs, they are delimited with a semi-colon.

    The --delim operator separates the --echo and --echo-map-id results with a tab character. This allows the output to remain a relaxed, three-column UCSC BED file, so you can process this with any downstream tools that take in BED data (like bedmap, bedops, or other BEDOPS utilities).

    You don't have to just grab IDs from the UTR file. If you want the entire UTR element, use --echo-map. If you want just the UTR regions, use --echo-map-range.

    There are several other --echo-map-* operators which summarize different information from mapped elements. Check out the Echo section of the bedmap documentation for more detail.

    Note: BEDOPS utilities run very fast and use very little memory, compared with alternative toolkits which do not yet use our design optimizations. This is because we require that BED inputs are sorted (at most, only once), which adds structure that we can take advantage of. Data you get through the UCSC Genome Browser will be sorted. However, you might need to sort data in your intervals file, if you do not know its sort status. This is easy and relatively quick, making use of the BEDOPS sort-bed tool:

    $ sort-bed Intervals.bed > Sorted_Intervals.bed

    Again, this only needs to be done once. BEDOPS tools read and write sorted data, so any results from, say, bedmap do not need any further sorting before using results with downstream tools.
    Last edited by AlexReynolds; 02-21-2013, 01:47 PM.

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Advancing Precision Medicine for Rare Diseases in Children
      by seqadmin




      Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
      12-16-2024, 07:57 AM
    • seqadmin
      Recent Advances in Sequencing Technologies
      by seqadmin



      Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

      Long-Read Sequencing
      Long-read sequencing has seen remarkable advancements,...
      12-02-2024, 01:49 PM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, 12-17-2024, 10:28 AM
    0 responses
    22 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 12-13-2024, 08:24 AM
    0 responses
    42 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 12-12-2024, 07:41 AM
    0 responses
    28 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 12-11-2024, 07:45 AM
    0 responses
    42 views
    0 likes
    Last Post seqadmin  
    Working...
    X