Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • question about UCSC bed file description

    Hi all,

    I want to extract the coordinates of each genes in hg19 from their -2000bp of TSS to the end of first exon. So I downloaded the bed file from UCSC. The files are like this:
    Code:
    track name="tb_refGene" description="table browser query on refGene" visibility=3 url= 
    chr1	66999824	67210768	NM_032291	0	+	67000041	67208778	0	25	227,64,25,72,57,55,176,12,12,25,52,86,93,75,501,128,127,60,112,156,133,203,65,165,2013,	0,91705,98928,101802,105635,108668,109402,126371,133388,136853,137802,139139,142862,145536,147727,155006,156048,161292,185152,195122,199606,205193,206516,207130,208931,
    chr1	33546713	33586132	NM_052998	0	+	33547850	33585783	0	12	182,121,212,177,174,173,135,166,163,113,215,488,	0,275,488,1065,2841,10937,12169,13435,15594,16954,36789,38931,
    chr1	33546713	33586132	NM_001293562	0	+	33547850	33585783	0	11	182,118,177,174,173,135,166,163,113,215,488,	0,278,1065,2841,10937,12169,13435,15594,16954,36789,38931,
    chr1	25071759	25170815	NM_013943	0	+	25072044	25167428	0	6	357,110,126,107,182,3552,	0,52473,68825,81741,94591,95504,
    chr1	48998526	50489626	NM_032785	0	-	48999844	50489468	0	14	1439,27,97,163,153,112,115,90,40,217,95,125,123,192,	0,2035,6787,54149,57978,101638,120482,130297,334336,512729,712915,1164458,1318541,1490908,
    chr1	16767166	16786584	NM_001145277	0	+	16767256	16785491	0	7	182,101,105,82,109,178,1248,	0,2960,7198,7388,8421,11166,18170,
    chr1	16767166	16786584	NM_001145278	0	+	16767256	16785385	0	8	104,101,105,82,109,178,76,1248,	0,2960,7198,7388,8421,11166,15146,18170,
    chr1	16767166	16786584	NM_018090	0	+	16767256	16785385	0	8	182,101,105,82,109,178,76,1248,	0,2960,7198,7388,8421,11166,15146,18170,
    chr1	8378144	8404227	NM_001080397	0	+	8378168	8404073	0	9	102,421,93,225,728,154,177,206,421,	0,6221,7213,7733,12124,17352,19731,21408,25662,
    chr1	92145899	92351836	NM_001195683	0	-	92149295	92327088	0	17	3515,108,42,121,300,159,141,153,335,190,148,169,184,138,185,174,402,	0,15329,17746,28320,31900,35893,36225,38969,39550,41612,47316,49462,54433,78270,116944,181128,205535,
    chr1	92145899	92351836	NR_036634	0	-	92351836	92351836	0	18	3515,108,42,121,300,159,141,153,338,190,148,169,184,138,185,97,174,402,	0,15329,17746,28320,31900,35893,36225,38969,39550,41612,47316,49462,54433,78270,116944,120616,181128,205535,
    chr1	92145899	92351836	NM_003243	0	-	92149295	92327088	0	17	3515,108,42,121,300,159,141,153,338,190,148,169,184,138,185,174,402,	0,15329,17746,28320,31900,35893,36225,38969,39550,41612,47316,49462,54433,78270,116944,181128,205535,
    chr1	92145899	92371559	NM_001195684	0	-	92149295	92327088	0	18	3515,108,42,121,300,159,141,153,335,190,148,169,184,138,185,174,61,177,	0,15329,17746,28320,31900,35893,36225,38969,39550,41612,47316,49462,54433,78270,116944,181128,219294,225483,
    chr1	100652477	100715409	NM_001918	0	-	100661810	100715376	0	11	9501,72,192,78,167,217,122,182,76,124,84,	0,19308,19523,23772,27895,29061,31704,43811,48514,53839,62848,
    chr1	175913961	176176380	NM_022457	0	-	175914288	176176114	0	20	345,45,161,125,118,117,82,109,144,136,115,58,77,60,69,120,77,98,60,673,	0,2369,42117,43462,44536,82746,98360,98884,101355,136326,140950,171798,190184,191662,204180,218043,218989,231084,239807,261746,
    chr1	184356149	184598155	NM_030806	0	+	184446643	184588690	0	6	353,218,95,77,61,9504,	0,90370,120572,203723,211385,232502,
    chr1	150980972	151008189	NM_021222	0	+	150981108	151006710	0	8	175,93,203,185,159,95,159,1908,	0,9315,9970,16114,17018,18736,20289,25309,
    can someone tell me the meaning of each column? I found a description of bed files in the UCSC table browser, but it doesn't seem to match the file I downloaded.

    Code:
    bin	name	chrom	strand	txStart	txEnd	cdsStart	cdsEnd	exonCount	exonStarts	exonEnds	score	name2	cdsStartStat	cdsEndStat	exonFrames
    138	NM_016166	chr15	+	68346571	68480404	68346664	68480173	14	68346571,68378643,68434283,68434627,68438153,68438903,68445927,68457068,68466069,68467974,68468811,68473549,68475967,68479879,	68346688,68379088,68434368,68434675,68438244,68439038,68446033,68457142,68466230,68468105,68468992,68473692,68476005,68480404,	0	PIAS1	cmpl	cmpl	0,0,1,2,2,0,0,1,0,2,1,2,1,0,
    636	NM_016162	chr12	-	6759703	6772308	6760360	6772267	8	6759703,6760489,6761436,6761827,6762101,6762395,6765892,6772230,	6760400,6760551,6761584,6761933,6762216,6762562,6765964,6772308,	0	ING4	cmpl	cmpl	2,0,2,1,0,1,1,0,
    1314	NM_016156	chr11	-	95566043	95657371	95568453	95657118	15	95566043,95569311,95571257,95574780,95578116,95580877,95582837,95583763,95590715,95591694,95595155,95595435,95598764,95621319,95 ...	95568615,95569488,95571371,95574873,95578323,95581063,95583026,95583913,95590799,95591796,95595266,95595530,95598840,95621425,95 ...	0	MTMR2	cmpl	cmpl	0,0,0,0,0,0,0,0,0,0,0,1,0,2,0,
    858	NM_006739	chr22	+	35796115	35820495	35796431	35820248	17	35796115,35796423,35799199,35799406,35802545,35804400,35806736,35808502,35809867,35811821,35812331,35812630,35813736,35815876,35 ...	35796201,35796598,35799326,35799535,35802718,35804556,35806903,35808674,35809979,35811965,35812397,35812807,35813849,35816005,35 ...	0	MCM5	cmpl	cmpl	-1,0,2,0,0,2,2,1,2,0,0,0,0,2,2,1,0,
    1761	NM_032385	chr5	-	154198051	154230213	154199875	154217738	9	154198051,154200819,154202031,154202946,154210360,154214184,154214402,154217690,154230042,	154200032,154200986,154202137,154203152,154210482,154214288,154214494,154217738,154230213,	0	FAXDC2	cmpl	cmpl	2,0,2,0,1,2,0,0,-1,
    587	NM_021034	chr11	-	319672	320914	319837	320813	2	319672,320564,	319990,320914,	0	IFITM3	cmpl	cmpl	0,0,
    1162	NM_020932	chrX	+	75648045	75651746	75648323	75651197	1	75648045,	75651746,	0	MAGEE1	cmpl	cmpl	0,
    13	NM_020929	chr11	-	40135750	41481186	40135919	40137842	5	40135750,40162350,40341177,40669691,41480980,	40137884,40162403,40341271,40669828,41481186,	0	LRRC4C	cmpl	cmpl	0,-1,-1,-1,-1,
    1352	NM_021029	chrX	+	100645877	100651142	100645923	100650736	5	100645877,100646446,100646742,100650322,100650715,	100646034,100646552,100646810,100650445,100651142,	0	RPL36A	cmpl	cmpl	0,0,1,0,0,
    1688	NM_032378	chr8	-	144661866	144679845	144661961	144672251	10	144661866,144662181,144662675,144663223,144663398,144668388,144668898,144671160,144672777,144679517,	144662000,144662376,144662897,144663324,144663498,144668460,144669022,144672251,144672908,144679845,	0	EEF1D	cmpl	cmpl	0,0,0,1,0,0,2,0,-1,-1,
    thanks

  • #2
    This looks like the "whole gene" description format, which is not what you want here.
    - Go back to the UCSC Table Browser.
    - Select Mammal / Human / hg19 / Genes / UCSC genes (for example) / known genes / output=BED.
    - Then click on "get output".
    - On the next page ("Output known genes as BED") select "5'UTR exons" and click "get BED".

    Now you have a BED file with one entry per initial exon. Just use bedtools to extend each interval 2Kb upstream (using "bedtools slop" with "-l 2000" and "-s") and you're done. From 2Kb before the TSS to the first donor site. Oh, just keep in mind that initial exons from different transcripts or even genes might overlap with each other, so you might want to check for redundancy and/or overlap before processing any further.

    Comment


    • #3
      thanks! it works like a charm.

      Originally posted by syfo View Post
      This looks like the "whole gene" description format, which is not what you want here.
      - Go back to the UCSC Table Browser.
      - Select Mammal / Human / hg19 / Genes / UCSC genes (for example) / known genes / output=BED.
      - Then click on "get output".
      - On the next page ("Output known genes as BED") select "5'UTR exons" and click "get BED".

      Now you have a BED file with one entry per initial exon. Just use bedtools to extend each interval 2Kb upstream (using "bedtools slop" with "-l 2000" and "-s") and you're done. From 2Kb before the TSS to the first donor site. Oh, just keep in mind that initial exons from different transcripts or even genes might overlap with each other, so you might want to check for redundancy and/or overlap before processing any further.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Essential Discoveries and Tools in Epitranscriptomics
        by seqadmin




        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
        04-22-2024, 07:01 AM
      • seqadmin
        Current Approaches to Protein Sequencing
        by seqadmin


        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
        04-04-2024, 04:25 PM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, Yesterday, 11:49 AM
      0 responses
      15 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-24-2024, 08:47 AM
      0 responses
      16 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-11-2024, 12:08 PM
      0 responses
      61 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 10:19 PM
      0 responses
      60 views
      0 likes
      Last Post seqadmin  
      Working...
      X