Seqanswers Leaderboard Ad

**laura** · 11-17-2010, 01:53 PM

Biomart can return fasta files for transcripts which you can get for either a part of the genome or a list of gene ids

http://www.ensembl.org/biomart/

Help - Video Tutorials - Homo_sapiens - Ensembl genome browser 111

http://www.ensembl.org/Help/Movie?id=189

Look at the results tab sequence option for more details

Alternatively you can do it with the api

The api tutorial has lots of details

Core API Tutorial

http://www.ensembl.org/info/docs/api/core/core_tutorial.html

but here is a quick bit of perl which would work after you had installed the apis, it won't be the fastest way to do it (that would be to get the fasta dump from the ftp site http://www.ensembl.org/info/data/ftp/index.html)

Code:

use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
    -host => 'ensembldb.ensembl.org',
    -user => 'anonymous'
);

my $sa = $registry->get_adaptor($species, 'core', 'Slice');
my $slices = $sa->fetch_all('toplevel');

foreach my $slice(@$slices){
  foreach my $gene(@{$slice->get_all_Genes_by_type('protein_coding')}){
     foreach my $transcript(@{$gene->get_all_Transcripts}){
        print $transcript->display_id."\n";
        print $transcript->seq->seq."\n";
     }
  }
}

This particular piece of code will return all the sequences of all the protein coding genes in the ensembl human database

As far as conventions go, ensembl returns the sequence from the strand the transcript is on so a forward strand transcript would give you spliced forward strand sequence but a reverse strand transcript would give you spliced reverse strand sequence.

Something to remember is that while the api can handle giving you the appropriate cdna sequence for a transcript ensembl always reports coordinates forward strand 5" to 3" so the start is always smaller than the end. This means for reverse strand transcripts the last coding exon in terms of genomic coordinates is where transcription starts

**bioinfosm** · 11-17-2010, 03:00 PM

laura, how is the ambiguity of overlapping transcripts resolved in such cases?

**laura** · 11-17-2010, 03:33 PM

Not sure what you mean by overlapping transcripts

Each transcript within a gene is an independent entity and as such has and independent cDNA sequence

You can get all the exons which belong to a gene and print the sequence of each exon and in this case you would end up printing out potentially overlapping sequence. The position of the transcript or exon (start and end methods) would provide you with this information

**kouroshz** · 11-17-2010, 04:19 PM

Thank you for the reply. when you say "ensembl returns the sequence from the strand the transcript is on" does that mean the sequence is identical to the mRNA transcript sequence? What if there are overlapping genes?

**laura** · 11-17-2010, 11:23 PM

In an ensembl annotation set, all overlapping transcripts on the same strand are merged into a single gene loci

By mrna sequence I am refering to the splice genomic sequence defined by the exon coordinates

**bioinfosm** · 11-18-2010, 02:55 PM

I meant on different strands. There are cases when there is transcription in both directions for a region (overlapping genes or transcripts)

**laura** · 11-18-2010, 03:14 PM

Genes which overlap but on opposite strands are considered independent entities and neither biomart nor ensembl will give you any object which contains them both

Is there something particular you are looking for here?

**kouroshz** · 11-19-2010, 05:17 PM

Hi,

I am trying to download the annotation for coding exons (and 3' UTRs) in bed format from Ensembl. I have a few questions, which I would highly appreciate your help with.

1) Is there a way to get the chromosome name, genomic coordinates, and strand information directly?

2) If I do the following in bioMart
Data set:
Homo Sepian (GRCh37)
Filter:
Protein_coding
Chromosome:1
Attributes:
Chromosome Name
Exon Chr Start (bp)
Exon Chr End (bp)
CDS Start
CDS End
cDNA coding start
cDNA coding end
Exon Rank in Transcript
5‘ UTR Start
5‘ UTR End
3‘ UTR Start
3‘ UTR End
Transcript Start (bp)
Transcript End (bp)
Strand

I get a table, however I cannot really explain the result.

a) are cDNA coding start and stop relative to the transcript?

b) are CDS start and stop relative to cDNA coding sequence?

c) if the above two are correct the the output of the above filters and attributes will indicate that the cDNA start is between Exon 1 and Exon 2.

I would very much appreciate you help i obtaining the annotations for coding exons and UTRs in separate bed files. Also I would prefer to download the files and do my own text parsing.

Thank you very much.

Kourosh.

**bioinfosm** · 11-19-2010, 09:33 PM

Originally posted by laura View Post

Genes which overlap but on opposite strands are considered independent entities and neither biomart nor ensembl will give you any object which contains them both

Is there something particular you are looking for here?

I was looking for discussion on how to evaluate expression of overlapping genes on opposite strand. I would guess RNA-Seq cannot really do that, unless its a strand specific library preparation protocol.

thanks

**lmf_bill** · 11-20-2010, 05:25 AM

correctly, it is hard to estimate the expression of overlapping genes by RNASeq within popular protocol. It will be possible for the short part overlapping genes.

**laura** · 11-20-2010, 08:04 AM

Originally posted by kouroshz View Post

Hi,

I am trying to download the annotation for coding exons (and 3' UTRs) in bed format from Ensembl. I have a few questions, which I would highly appreciate your help with.

1) Is there a way to get the chromosome name, genomic coordinates, and strand information directly?

2) If I do the following in bioMart
Data set:
Homo Sepian (GRCh37)
Filter:
Protein_coding
Chromosome:1
Attributes:
Chromosome Name
Exon Chr Start (bp)
Exon Chr End (bp)
CDS Start
CDS End
cDNA coding start
cDNA coding end
Exon Rank in Transcript
5‘ UTR Start
5‘ UTR End
3‘ UTR Start
3‘ UTR End
Transcript Start (bp)
Transcript End (bp)
Strand

I get a table, however I cannot really explain the result.

a) are cDNA coding start and stop relative to the transcript?

b) are CDS start and stop relative to cDNA coding sequence?

c) if the above two are correct the the output of the above filters and attributes will indicate that the cDNA start is between Exon 1 and Exon 2.

I would very much appreciate you help i obtaining the annotations for coding exons and UTRs in separate bed files. Also I would prefer to download the files and do my own text parsing.

Thank you very much.

Kourosh.

Kourosh, I would suggest you might be better getting the details from the ensembl core database rather than using biomart as you can more easily tailor the results you need

For ideas on how to use the ensembl api and the public instance beyond the example perl script I gave above have a look here

Core API Tutorial

http://www.ensembl.org/info/docs/api/core/core_tutorial.html

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 14 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

Annotation Convention

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News