Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extracting information from EMBL flat file

    Hey guys,
    I have a file of the proteome set of C. elegans that I retrieved from Uniprot in a EMBL flat file like this:

    EMBL_FLAT_FILE_CELEGANS
    Code:
    ID   14331_CAEEL             Reviewed;         248 AA.
    AC   P41932; Q21537;
    DT   01-NOV-1995, integrated into UniProtKB/Swiss-Prot.
    DT   22-JUL-2008, sequence version 2.
    DT   28-NOV-2012, entry version 95.
    DE   RecName: Full=14-3-3-like protein 1;
    DE   AltName: Full=Partitioning defective protein 5;
    GN   Name=par-5; Synonyms=ftt-1; ORFNames=M117.2;
    OS   Caenorhabditis elegans.
    OC   Eukaryota; Metazoa; Ecdysozoa; Nematoda; Chromadorea; Rhabditida;
    OC   Rhabditoidea; Rhabditidae; Peloderinae; Caenorhabditis.
    DR   GO; GO:0005938; C:cell cortex; IDA:WormBase.
    DR   GO; GO:0005634; C:nucleus; IDA:WormBase.
    DR   GO; GO:0045167; P:asymmetric protein localization involved in cell fate determination; IMP:WormBase.
    DR   GO; GO:0001708; P:cell fate specification; IMP:WormBase.
    DR   GO; GO:0043053; P:dauer entry; IMP:WormBase.
    DR   GO; GO:0008340; P:determination of adult lifespan; IMP:WormBase.
    DR   GO; GO:0009792; P:embryo development ending in birth or egg hatching; IMP:WormBase.
    DR   GO; GO:0000132; P:establishment of mitotic spindle orientation; IMP:WormBase.
    DR   GO; GO:0030590; P:first cell cycle pseudocleavage; IMP:WormBase.
    DR   GO; GO:0035188; P:hatching; IMP:WormBase.
    DR   GO; GO:0007126; P:meiosis; IMP:WormBase.
    DR   GO; GO:0002009; P:morphogenesis of an epithelium; IMP:WormBase.
    DR   GO; GO:0009949; P:polarity specification of anterior/posterior axis; IMP:WormBase.
    DR   GO; GO:0035046; P:pronuclear migration; IMP:WormBase.
    DR   GO; GO:0006898; P:receptor-mediated endocytosis; IMP:WormBase.
    DR   GO; GO:0007346; P:regulation of mitotic cell cycle; IMP:WormBase.
    DR   GO; GO:0010070; P:zygote asymmetric cell division; IMP:WormBase.
    SQ   SEQUENCE   248 AA;  28191 MW;  ABBE0DA27D9341AF CRC64;
         MSDTVEELVQ RAKLAEQAER YDDMAAAMKK VTEQGQELSN EERNLLSVAY KNVVGARRSS
         WRVISSIEQK TEGSEKKQQL AKEYRVKVEQ ELNDICQDVL KLLDEFLIVK AGAAESKVFY
         LKMKGDYYRY LAEVASEDRA AVVEKSQKAY QEALDIAKDK MQPTHPIRLG LALNFSVFYY
         EILNTPEHAC QLAKQAFDDA IAELDTLNED SYKDSTLIMQ LLRDNLTLWT SDVGAEDQEQ
         EGNQEAGN
    //
    NOTE: the file showed is here shortened.


    Moreover, I have another file with a lot of gene full names and I would like to extract informations of GO for these genes from the EMBL flat file. In other words, I would like to know if someone here have some script that read my file with the gene full names (one per line), find it in this EMBL flat file and extract the GO. The output desirable is the gene full name followed by its gene ontology separated by comma (including each ontology).

    OUTPUT
    Code:
    GENE_A,GO; GO:0001708; P:cell fate specification; IMP:WormBase, GO; GO:0043053; P:dauer entry; IMP:WormBase,GO; GO:0008340; P:determination of adult lifespan; IMP:WormBase,GO; GO:0009792; P:embryo development ending in birth or egg hatching; IMP:WormBase, GO; GO:0000132; P:establishment of mitotic spindle orientation; IMP:WormBase
    If you guys have other ideas it would be nice!

    Cheers.

  • #2
    Do you know any scripting/programming language? Both BioPerl and Biopython (and likely other libraries too) could assist you with their EMBL parsers - although in this case you could do this without a full parser.

    Comment


    • #3
      biomaRt (R/bioconductor): http://www.bioconductor.org/packages...l/biomaRt.html

      Code:
      library( biomaRt )
      
      uniprot = useMart( "unimart" );
      uniprot = useDataset( "uniprot", uniprot );
      
      # these can be looked at for more options in search(filters) and retrieve(attributes)
      
      filters = listFilters( uniprot );
      attributes = listAttributes( uniprot )
      
      useFilter = c( "accession" );
      useAttributes = c( "accession", "gene_name", "go_id", "go_name" );
      
      query = "P41932";
      df = getBM( mart=uniprot, values=c(query), filters=useFilter, attributes=useAttributes )
      
      nrow = dim( df )[ 1 ];
      s=sprintf( "%s", df[1,2] );
      for( i in 1:nrow ) {
              s = sprintf( "%s,GO; %s; %s;", s, df[i,3], df[i,4] );
      }
      If you have a text file full of accessions and want output with 1 gene per line:

      Code:
      query = read.table( "queryfile.txt" );
      # assume 1st column is accession
      
      query = as.character( query[,1] );
      
      mdf = getBM( mart=uniprot, values=query, filters=useFilter, attributes=useAttributes )
      
      uniqueAccs = unique( sort( as.character( mdf[,1] ) ) );
      outvec = vector( mode="character", length=0 );
      for( acc in uniqueAccs ) {
              df = mdf[ mdf[,1] == acc, ];
              nrow = dim( df )[ 1 ];
              s=sprintf( "%s", df[1,2] );
              for( i in 1:nrow ) {
                      s = sprintf( "%s,GO; %s; %s;", s, df[i,3], df[i,4] );
              }
              outvec = c( outvec, s );
      }
      write.table( outvec, "myoutfile.txt", quote=F, row.names=F, col.names=F );
      (the second code snippet depends on the preamble from the first)

      EDIT: I realize I did not answer your question, but this will get the job done without any need for downloading embl files.
      Last edited by jiaco; 11-30-2012, 06:05 AM.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Advancing Precision Medicine for Rare Diseases in Children
        by seqadmin




        Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
        12-16-2024, 07:57 AM
      • seqadmin
        Recent Advances in Sequencing Technologies
        by seqadmin



        Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

        Long-Read Sequencing
        Long-read sequencing has seen remarkable advancements,...
        12-02-2024, 01:49 PM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, 12-17-2024, 10:28 AM
      0 responses
      27 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 12-13-2024, 08:24 AM
      0 responses
      43 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 12-12-2024, 07:41 AM
      0 responses
      29 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 12-11-2024, 07:45 AM
      0 responses
      42 views
      0 likes
      Last Post seqadmin  
      Working...
      X