Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Annotation Convention

    Hi All,

    I am a newbie and I have a basic question. In Exon annotations from UCSC or Biomart what is the convention they use in reporting the strand?

    What I need to know is how to get the actual RNA transcript sequence from the exon annotations? Is the RNA sequence the same as the exon sequence or do I have to reveres complement the exon sequence?

    More precisely how do I obtain the RNA seq in each of the following examples:

    1)

    Exon Start:1000 Stop: 1004 Strand:+ Seq: ACGT

    2)

    Exon Start:1000 Stop: 1004 Strand:- Seq: ACGT


    Thanks

    k

  • #2
    Biomart can return fasta files for transcripts which you can get for either a part of the genome or a list of gene ids




    Look at the results tab sequence option for more details

    Alternatively you can do it with the api

    The api tutorial has lots of details



    but here is a quick bit of perl which would work after you had installed the apis, it won't be the fastest way to do it (that would be to get the fasta dump from the ftp site http://www.ensembl.org/info/data/ftp/index.html)

    Code:
    use Bio::EnsEMBL::Registry;
    
    my $registry = 'Bio::EnsEMBL::Registry';
    
    $registry->load_registry_from_db(
        -host => 'ensembldb.ensembl.org',
        -user => 'anonymous'
    );
    
    my $sa = $registry->get_adaptor($species, 'core', 'Slice');
    my $slices = $sa->fetch_all('toplevel');
    
    foreach my $slice(@$slices){
      foreach my $gene(@{$slice->get_all_Genes_by_type('protein_coding')}){
         foreach my $transcript(@{$gene->get_all_Transcripts}){
            print $transcript->display_id."\n";
            print $transcript->seq->seq."\n";
         }
      }
    }
    This particular piece of code will return all the sequences of all the protein coding genes in the ensembl human database

    As far as conventions go, ensembl returns the sequence from the strand the transcript is on so a forward strand transcript would give you spliced forward strand sequence but a reverse strand transcript would give you spliced reverse strand sequence.

    Something to remember is that while the api can handle giving you the appropriate cdna sequence for a transcript ensembl always reports coordinates forward strand 5" to 3" so the start is always smaller than the end. This means for reverse strand transcripts the last coding exon in terms of genomic coordinates is where transcription starts

    Comment


    • #3
      laura, how is the ambiguity of overlapping transcripts resolved in such cases?
      --
      bioinfosm

      Comment


      • #4
        Not sure what you mean by overlapping transcripts

        Each transcript within a gene is an independent entity and as such has and independent cDNA sequence

        You can get all the exons which belong to a gene and print the sequence of each exon and in this case you would end up printing out potentially overlapping sequence. The position of the transcript or exon (start and end methods) would provide you with this information
        Last edited by laura; 11-17-2010, 11:23 PM.

        Comment


        • #5
          Thank you for the reply. when you say "ensembl returns the sequence from the strand the transcript is on" does that mean the sequence is identical to the mRNA transcript sequence? What if there are overlapping genes?

          Comment


          • #6
            In an ensembl annotation set, all overlapping transcripts on the same strand are merged into a single gene loci

            By mrna sequence I am refering to the splice genomic sequence defined by the exon coordinates

            Comment


            • #7
              I meant on different strands. There are cases when there is transcription in both directions for a region (overlapping genes or transcripts)
              --
              bioinfosm

              Comment


              • #8
                Genes which overlap but on opposite strands are considered independent entities and neither biomart nor ensembl will give you any object which contains them both

                Is there something particular you are looking for here?

                Comment


                • #9
                  Hi,

                  I am trying to download the annotation for coding exons (and 3' UTRs) in bed format from Ensembl. I have a few questions, which I would highly appreciate your help with.

                  1) Is there a way to get the chromosome name, genomic coordinates, and strand information directly?

                  2) If I do the following in bioMart
                  Data set:
                  Homo Sepian (GRCh37)
                  Filter:
                  Protein_coding
                  Chromosome:1
                  Attributes:
                  Chromosome Name
                  Exon Chr Start (bp)
                  Exon Chr End (bp)
                  CDS Start
                  CDS End
                  cDNA coding start
                  cDNA coding end
                  Exon Rank in Transcript
                  5‘ UTR Start
                  5‘ UTR End
                  3‘ UTR Start
                  3‘ UTR End
                  Transcript Start (bp)
                  Transcript End (bp)
                  Strand

                  I get a table, however I cannot really explain the result.

                  a) are cDNA coding start and stop relative to the transcript?

                  b) are CDS start and stop relative to cDNA coding sequence?

                  c) if the above two are correct the the output of the above filters and attributes will indicate that the cDNA start is between Exon 1 and Exon 2.

                  I would very much appreciate you help i obtaining the annotations for coding exons and UTRs in separate bed files. Also I would prefer to download the files and do my own text parsing.

                  Thank you very much.

                  Kourosh.

                  Comment


                  • #10
                    Originally posted by laura View Post
                    Genes which overlap but on opposite strands are considered independent entities and neither biomart nor ensembl will give you any object which contains them both

                    Is there something particular you are looking for here?
                    I was looking for discussion on how to evaluate expression of overlapping genes on opposite strand. I would guess RNA-Seq cannot really do that, unless its a strand specific library preparation protocol.

                    thanks
                    --
                    bioinfosm

                    Comment


                    • #11
                      correctly, it is hard to estimate the expression of overlapping genes by RNASeq within popular protocol. It will be possible for the short part overlapping genes.

                      Comment


                      • #12
                        Originally posted by kouroshz View Post
                        Hi,

                        I am trying to download the annotation for coding exons (and 3' UTRs) in bed format from Ensembl. I have a few questions, which I would highly appreciate your help with.

                        1) Is there a way to get the chromosome name, genomic coordinates, and strand information directly?

                        2) If I do the following in bioMart
                        Data set:
                        Homo Sepian (GRCh37)
                        Filter:
                        Protein_coding
                        Chromosome:1
                        Attributes:
                        Chromosome Name
                        Exon Chr Start (bp)
                        Exon Chr End (bp)
                        CDS Start
                        CDS End
                        cDNA coding start
                        cDNA coding end
                        Exon Rank in Transcript
                        5‘ UTR Start
                        5‘ UTR End
                        3‘ UTR Start
                        3‘ UTR End
                        Transcript Start (bp)
                        Transcript End (bp)
                        Strand

                        I get a table, however I cannot really explain the result.

                        a) are cDNA coding start and stop relative to the transcript?

                        b) are CDS start and stop relative to cDNA coding sequence?

                        c) if the above two are correct the the output of the above filters and attributes will indicate that the cDNA start is between Exon 1 and Exon 2.

                        I would very much appreciate you help i obtaining the annotations for coding exons and UTRs in separate bed files. Also I would prefer to download the files and do my own text parsing.

                        Thank you very much.

                        Kourosh.
                        Kourosh, I would suggest you might be better getting the details from the ensembl core database rather than using biomart as you can more easily tailor the results you need

                        For ideas on how to use the ensembl api and the public instance beyond the example perl script I gave above have a look here

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Essential Discoveries and Tools in Epitranscriptomics
                          by seqadmin




                          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                          04-22-2024, 07:01 AM
                        • seqadmin
                          Current Approaches to Protein Sequencing
                          by seqadmin


                          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                          04-04-2024, 04:25 PM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, Yesterday, 08:47 AM
                        0 responses
                        14 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-11-2024, 12:08 PM
                        0 responses
                        60 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 10:19 PM
                        0 responses
                        60 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 09:21 AM
                        0 responses
                        54 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X