SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Is it possible to convert a SNP.txt to a bed file or get a SNP.bed from samtools? Ling Bioinformatics 7 04-02-2015 06:17 AM
Is there a BED file format validator? Does a BED file have to be sorted position? LauraSmith Bioinformatics 3 05-21-2013 11:54 AM
Gene-Level Analysis dyslecix RNA Sequencing 1 12-05-2011 04:16 AM
gene and isoform expression level estimation with Cufflinks Jane M RNA Sequencing 1 08-31-2011 11:47 PM
cuffcompare can not handle mouse gtf file from ensembl liuxq Bioinformatics 1 09-05-2010 11:54 PM

Reply
 
Thread Tools
Old 09-20-2010, 01:07 PM   #1
rkusko
Junior Member
 
Location: USA

Join Date: Jul 2010
Posts: 4
Default Gene level ensembl bed file?

Hey all,
I'm trying to run Scripture's score task and look at human gene level expression. Does anyone know where I can find a bed6 format file with Ensembl Gene annotation (ENSG) instead of Ensembl transcript annotation (ENST)?
I realize I could use the counts from the transcript level output to get gene level output, but then I lose the fwer p-values that scripture outputs.
rkusko is offline   Reply With Quote
Old 09-21-2010, 03:20 PM   #2
malachig
Senior Member
 
Location: WashU

Join Date: Aug 2010
Posts: 117
Default

Do you mean that you wish to merge overlapping exons for Ensembl genes that have multiple isoforms? So that you have a single BED line for each gene?
malachig is offline   Reply With Quote
Old 09-21-2010, 07:13 PM   #3
rkusko
Junior Member
 
Location: USA

Join Date: Jul 2010
Posts: 4
Default

Yes, I am looking for a file (or working on creating a file) where each ensembl gene is one line of a bed6 format file.

I'm not so concerned with isoforms. Most ensembl genes have multiple ensembl transcripts that point within the location of the gene. The only ensembl Human Genome bed6 files I have found have contained ensembl transciprt (ENST) ids rather than ensembl gene ids (ENSG). I do have a bed12 file at the gene level instead of the transcript level, but it is taking me some time to write a script to take care of this. Thus why I am looking for a human ENSG bed6 file.

Does that answer your question?
rkusko is offline   Reply With Quote
Old 09-22-2010, 02:46 AM   #4
BetterPrimate
Member
 
Location: NSW

Join Date: May 2010
Posts: 15
Default

Use USCS table browser. Instead of selecting output to BED, choose "selected fields from table" and choose the 6 fields you need. Not sure if fields will naturally occur in the order you prefer. If not, that'd be easy to fix if you do it through galaxy.
BetterPrimate is offline   Reply With Quote
Old 09-22-2010, 09:53 AM   #5
malachig
Senior Member
 
Location: WashU

Join Date: Aug 2010
Posts: 117
Default

This was my initial thought as well. However, when I tried it in Galaxy, the import just sat there for ever with a message: 'waiting to run'. Perhaps this was just a temporary problem and Galaxy will do the trick...

When I tried it directly in the UCSC table browser it gave me a BED file with one line per transcript, even though I selected the Gene table and specified that the BED be created with 'one line per whole gene'.

Of course, the info you would need to create a gene-level BED file, is in this file. I was also able to get the necessary info from Ensembl Biomart, but again I didn't see an obvious way to output directly to BED6 format.

Since rkusko already has the required info but in BED12 format, neither the UCSC or Ensembl option seems more convenient.

Where did the BED12 version come from? Can you post it, or a sample of it in case someone has a ready-made converter to try...
malachig is offline   Reply With Quote
Old 09-22-2010, 10:07 AM   #6
malachig
Senior Member
 
Location: WashU

Join Date: Aug 2010
Posts: 117
Default

Update:

My Galaxy task did finally complete. I used Galaxy to import the Ensembl gene annotations from UCSC and output them as BED (which can be done entirely at UCSC just as easily). Unfortunately, this produces one line per transcript not one line per gene. In retrospect, this is not surprising given that UCSC's concept of a 'gene' is basically a transcript.

Anyway, why not start with the transcript level file and use BEDTOOLS to merge overlapping features on the same strand. You should be able to do this using the 'mergeBed' function with the '-s' option to force strandedness.

One potential problem I see with this approach is that in rare cases there may be multiple genes, on the same strand with some overlap... So you might accidentally merge these into a single gene... mergeBed allows you to report the names of the things that were merged so you could use this option and then explicitly look for cases where transcripts from different genes were merged.
malachig is offline   Reply With Quote
Old 09-22-2010, 07:18 PM   #7
adamdeluca
Member
 
Location: Iowa City, IA

Join Date: Jul 2010
Posts: 95
Default

Quote:
Originally Posted by malachig View Post
One potential problem I see with this approach is that in rare cases there may be multiple genes, on the same strand with some overlap... So you might accidentally merge these into a single gene
A hack, but... BEDTools's mergeBed just treats the chromosome as a string. Concatenate the ENSG id and the chr# and it will merge the way you want.

Code:
ENSG00000166157_chr21	9928080	10012775
ENSG00000166157_chr21	9928080	10012775
ENSG00000166157_chr21	9928080	9993593
ENSG00000166157_chr21	9928080	10012775
ENSG00000166157_chr21	9928611	10012753
becomes
Code:
ENSG00000166157_chr21	9928080	10012775
adamdeluca is offline   Reply With Quote
Old 10-24-2012, 02:30 AM   #8
biocyberman
Junior Member
 
Location: Denmark

Join Date: Nov 2010
Posts: 2
Default

Quote:
Originally Posted by malachig View Post
Update:

My Galaxy task did finally complete. I used Galaxy to import the Ensembl gene annotations from UCSC and output them as BED (which can be done entirely at UCSC just as easily). Unfortunately, this produces one line per transcript not one line per gene. In retrospect, this is not surprising given that UCSC's concept of a 'gene' is basically a transcript.

Anyway, why not start with the transcript level file and use BEDTOOLS to merge overlapping features on the same strand. You should be able to do this using the 'mergeBed' function with the '-s' option to force strandedness.

One potential problem I see with this approach is that in rare cases there may be multiple genes, on the same strand with some overlap... So you might accidentally merge these into a single gene... mergeBed allows you to report the names of the things that were merged so you could use this option and then explicitly look for cases where transcripts from different genes were merged.
The Bedtools are interesting. Thanks :-)
biocyberman is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:40 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO