Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • can myrna build ref jar files from metazoa.ensembl.org databases?

    I have tried to alter the Ensembl.R script to use the correct "mart" by altering the the respective line to:

    ensembl = useMart("metazoa_mart_7", dataset=paste(organism, "_eg_gene", sep=""))

    which I gleaned from playing with useMart() and listDatasets() and the like. The script seems to connect to the right mart but chokes at the second stage with the following error:


    Code:
    augustine@jupiter 02:31:18 ~/work/data/myrna/refs/aedes_metazoaEnsmbl_r7: 
    Rscript --vanilla --default-packages=base,methods,utils,stats,IRanges,biomaRt,Biostrings /home/augustine/src/myrna-1.0.9/reftools/Ensembl_metazoa.R --args aaegypti ftp://ftp.ensemblgenomes.org/pub/metazoa/release-7/fasta/aedes_aegypti 0 0
    Ensembl.R [02h:32m:20s]: organism: aaegypti
    Ensembl.R [02h:32m:20s]: 1/11. Connecting to Ensembl via biomaRt
    Ensembl.R [02h:32m:36s]: 2/11. Getting table of exons (somewhat slow)
    Error in getBM(attributes = c("ensembl_gene_id", "ensembl_transcript_id",  : 
      Query ERROR: caught BioMart::Exception::Usage: Attributes from multiple attribute pages are not allowed
    Execution halted
    It REALLY pissed me off when ensembl moved all the vector insects to a separate site but its pretty infuriating if they have also drifted from the standard DB schema which seems to make all tools built to access ensembl data useless on the metazoa.ensembl.org site! Can anyone help me here?

    Gus
    In science, "fact" can only mean "confirmed to such a degree that it would be perverse to withhold provisional assent." I suppose that apples might start to rise tomorrow, but the possibility does not merit equal time in physics classrooms.
    --Stephen Jay Gould

  • #2
    R code to connect to Metazoa Mart

    Hi Gus,

    This is an example of code to connect to the Metazoa Mart that works for me.

    Code:
    species="aaegypti"
    dataset=sprintf("%s_%s",species,"eg_gene") 
    
    mart = useMart("metazoa_mart_7", dataset=dataset)
    
    gene.attributes=c("ensembl_gene_id","chromosome_name", "start_position","end_position","strand","gene_biotype")
    
    gene.coordinates=getBM( attributes = gene.attributes, mart=mart )
    gene.coordinates$strand=ifelse(gene.coordinates$strand>0,"+","-")
    
    head(gene.coordinates)
    The result is:
    Code:
      ensembl_gene_id chromosome_name start_position end_position strand   gene_biotype
    1      AAEL003237   supercont1.82        1756998      1757815      + protein_coding
    2      AAEL014602 supercont1.1187          47363        55875      + protein_coding
    3      AAEL010223  supercont1.463         378111       391870      + protein_coding
    4      AAEL013583  supercont1.875           6001         7010      + protein_coding
    5      AAEL011704  supercont1.604          22562        25978      + protein_coding
    6      AAEL009206  supercont1.378         908281       910123      + protein_coding
    Hope that helps,

    Gautier (on behalf of Ensembl Metazoa)

    Comment


    • #3
      Thanks, I just gave this a quick try in the R terminal and it seems to work for me too. I am going to try to integrate it into the Ensembl.R script in myrna. If it works would you mind if I publish the solution on my blog for others in my field to benefit from?

      Gus
      In science, "fact" can only mean "confirmed to such a degree that it would be perverse to withhold provisional assent." I suppose that apples might start to rise tomorrow, but the possibility does not merit equal time in physics classrooms.
      --Stephen Jay Gould

      Comment


      • #4
        I spoke too soon. It seems that the problem is not the R code per se it is in fact the metazoa.ensembl schema or something similar. It seems that in the "normal" ensembl DBs all of the following attributes are in the same table or something?

        Code:
        exons <- getBM(attributes=c(
        	"ensembl_gene_id",
        	"ensembl_transcript_id",
        	"ensembl_exon_id",
        	"chromosome_name",
        	"exon_chrom_start",
        	"exon_chrom_end",
        	"is_constitutive",
        	"gene_biotype"), mart=ensembl)
        The problem seems to be that the metazoa.ensembl DBs dont use the same organization scheme, at least when it comes to R-access. Does this sound like a reasonable assumption Gautier? If so is there even a solution to this? Secondly, is the different scheme due to Vectorbase's involvement? It REALLY slows down important research.

        Thanks for your help so far and tolerating my frustrations! I really DO appreciate all that you guys do.

        Gus
        In science, "fact" can only mean "confirmed to such a degree that it would be perverse to withhold provisional assent." I suppose that apples might start to rise tomorrow, but the possibility does not merit equal time in physics classrooms.
        --Stephen Jay Gould

        Comment


        • #5
          It's related to the Ensembl Genomes BioMart not VectorBase.

          The gene biotype attribute is not available when you query the gene structure (contrary to Ensembl)

          A solution is to query the BioMart twice and merge the 2 data frames:

          Code:
          library(biomaRt)
          
          species="aaegypti"
          dataset=sprintf("%s_%s",species,"eg_gene") 
          
          # use Ensembl Metazoa Mart
          
          mart = useMart("metazoa_mart_7", dataset=dataset)
          
          # exon attributes (structure)
          exon.attributes=c(
                  "ensembl_gene_id",
          	"ensembl_transcript_id",
          	"ensembl_exon_id",
          	"chromosome_name",
          	"exon_chrom_start",
          	"exon_chrom_end",
          	"is_constitutive")
          
          # gene attributes (feature)
          gene.attributes=c("ensembl_gene_id","gene_biotype")
          
          # get the gene biotypes
          gene_biotype=getBM( attributes = gene.attributes, mart=mart )
          
          # fetch the exon information
          exon.coordinates=getBM( attributes = exon.attributes, mart=mart )
          head(exons.coordinates)
          
          # merge the 2 datasets
          exons <- merge(exon.coordinates, gene_biotype, by.x = "ensembl_gene_id", by.y = "ensembl_gene_id", all = TRUE)
          
          head(exons)
          The exon.coordinates data.frame contains:

          Code:
          > head(exon.coordinates)
            ensembl_gene_id ensembl_transcript_id ensembl_exon_id chromosome_name exon_chrom_start exon_chrom_end is_constitutive
          1      AAEL003237         AAEL003237-RA         E000001   supercont1.82          1756998        1757071               0
          2      AAEL003237         AAEL003237-RA         E000002   supercont1.82          1757140        1757486               0
          3      AAEL003237         AAEL003237-RA         E000003   supercont1.82          1757547        1757815               0
          4      AAEL014602         AAEL014602-RA         E000004 supercont1.1187            47363          47375               0
          5      AAEL014602         AAEL014602-RA         E000005 supercont1.1187            48615          48826               0
          6      AAEL014602         AAEL014602-RA         E000006 supercont1.1187            55774          55875               0
          The gene biotype data.frame contains:

          Code:
          > head(gene_biotype)
            ensembl_gene_id   gene_biotype
          1      AAEL003237 protein_coding
          2      AAEL014602 protein_coding
          3      AAEL010223 protein_coding
          4      AAEL013583 protein_coding
          5      AAEL011704 protein_coding
          6      AAEL009206 protein_coding
          The resulting merge contains:

          Code:
          > head(exons)
            ensembl_gene_id ensembl_transcript_id ensembl_exon_id chromosome_name exon_chrom_start exon_chrom_end is_constitutive   gene_biotype
          1      AAEL000001         AAEL000001-RA         E051547    supercont1.1          4298016        4298425               0 protein_coding
          2      AAEL000001         AAEL000001-RA         E051548    supercont1.1          4296651        4297451               0 protein_coding
          3      AAEL000001         AAEL000001-RA         E051549    supercont1.1          4290724        4291102               0 protein_coding
          4      AAEL000002         AAEL000002-RA         E055552    supercont1.1          1151653        1151850               0 protein_coding
          5      AAEL000003         AAEL000003-RA         E019859    supercont1.1          3644561        3644835               0 protein_coding
          6      AAEL000003         AAEL000003-RA         E019860    supercont1.1          3655641        3655938               0 protein_coding
          This should solve your problem.

          Gautier

          Comment


          • #6
            Thanks again in advance.

            I will work that up and see if i can fuse it to the rest of the ref.jar building process of myrna. Once again, would you mind if I post the final product to my blog (with credit to you and the myrna folks of course) so as to help the rest of the poor folks in my position?

            Gus
            In science, "fact" can only mean "confirmed to such a degree that it would be perverse to withhold provisional assent." I suppose that apples might start to rise tomorrow, but the possibility does not merit equal time in physics classrooms.
            --Stephen Jay Gould

            Comment


            • #7
              You can of course publish this script on your blog.

              Best,

              Gautier

              Comment


              • #8
                myrna ref jar files made from plant.ensembl.org databases?

                Hi, I have a really basic question to add to this thread (forgive my ignorance, I'm extremely new to ngs-based analysis) as I would like to build a myrna ref jar for maize. I'll admit I haven't tried yet, but before I do I just wanted to see if the maize genome as it presently exists has "the right stuff" for myrna.

                Are plant.ensembl.org databases structured/formatted the same as metazoa.ensembl.org databases? That is, should the solutions published in this thread be relevant for making a myrna ref jar for maize as well?

                From what I can understand of the Ensembl.R script, it looks like the databases must have:

                $ensembl_gene_id, $ensembl_transcript_id, $exon_chrom_start, $exon_chrom_end

                but I'm not sure if ALL ensembl genomes have these or not.

                Also, just wondering if the "final product" Gus was referring to has been finished and/or published somewhere? I wasn't able to find it on the scipher blog. I am truly one of the "poor folks" in your position, maybe more so as I have only a novice background in programming...

                Thanks for any answers!

                karl

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM
                • seqadmin
                  Techniques and Challenges in Conservation Genomics
                  by seqadmin



                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                  Avian Conservation
                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                  03-08-2024, 10:41 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Yesterday, 06:37 PM
                0 responses
                12 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, Yesterday, 06:07 PM
                0 responses
                10 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-22-2024, 10:03 AM
                0 responses
                51 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-21-2024, 07:32 AM
                0 responses
                68 views
                0 likes
                Last Post seqadmin  
                Working...
                X