Helo all, I wanted to parse aEMBL format like file to fasta. i cannot use bioperl because this is not complete EMBL format. so please suggest me how to get this done..
The output should be in fasta format which consists of lines starting with ID, PT, PA and Sequence. "//" the two slashes are dividing lines between two EMBL genes.
I hope i am making sense..
Code:
ID 013789-0068 PS TBD OO huringiensis OS ringiensis OX SI 68 RA RL 2010. OKAYAMA UNIVERSITY,JAPAN LAMB CO LTD FT source 1..1176 MT AC 67106 SV CT PN 013789 PT PROTEIN PRODUCTION METHOD, FUSION PROTEIN, AND ANTISERUM PA AMA UNIVERSITY,JAPAN LAMB CO LTD. PI HAYAKAWA TORU (JP) SAKAI, HIROSHI, HAYAKAWA, TORU P8 P4 10013789 P5 0 PC International Classification: \nUS Classification: \nEuropean Classification: C12N15/62; C07K14/47A25 PR 80199166; PE 199166 AN 09JP63603 KC 1 P1 ng the DNA into a host bacterium to transform the host bacterium; and (c) causing the expression of the fusion protein in the transformed host bacterium.; The method may further comprise a step of removing the peptide chain (B) from the fusion protein. \n \n P7 P9 112 PO PM 10013789; PB 10013789 PQ 10013789; EM esentative W1 PRT D1 0204 D2 0217 D3 0730 D4 0801 D5 0204 HL [L[P9_GQ;0;3,WO2010013789,45,67]] [L[PM_PN_GQNUC;0;12,WO2010013789]] [L[PQ_PN_GQNUC;0;12,WO2010013789]] CC mer C1-1-f FH Key Location/Qualifiers Copyright (c)Inc. 2011 LS Application L2 Publ. Of int. appl. w4 MDNNPNINECIPYNCLSNPEVEVLGGERIETGYTPIDISLSLTQFLLSEFVPGAGFVLGLVDIIWGIFGPSQWDAFPVQI EQLINQRIEEFARNQAISRLEGLSNLYQIYAESFREWEADPTNPALREEMRIQFNDMNSALTTAIPLLAVQNYQVPLLSV YVQAANLHLSVLRDVSVFGQRWGFDAATINSRYNDLTRLIGNYTDYAVRWYNTGLERVWGPDSRDWVRYNQFRRELTLTV LDIVALFSNYDSRRYPIRTVSQLTREIYTNPVLENFDGSFRGMAQRIEQNIRQPHLMDILNSITIYTDVHRGFNYWSGHQ ITASPVGFSGPEFAFPLFGNAGNAAPPVLVSLTGLGIFRTLSSPLYRRIILGSGPNNQELFVLDGTEFSFASLTTNLPST IYRQRGTVDSLDVIPPQDNSVPPRAGFSHRLSHVTMLSQAAGAVYTLRAPTFSWQHRSAEFNNIIPSSQITQIPLTKSTN LGSGTSVVKGPGFTGGDILRRTSPGQISTLRVNITAPLSQRYRVRIRYASTTNLQFHTSIDGRPINQGNFSATMSSGSNL QSGSFRTVGFTTPFNFSNGSSVFTLSAHVFNSGNEVYIDRIEFVPAEVTFEAEYDLERAQKAVNELFTSSNQIGLKTDVT DYHIDQVSNLVECLSDEFCLDEKQELSEKVKHAKRLSDERNLLQDPNFRGINRQLDRGWRGSTDITIQGGDDVFKENYVT LLGTFDECYPTYLYQKIDESKLKAYTRYQLRGYIEDSQDLEIYLIRYNAKHETVNVPGTGSLWPLSAQSPIGKCGEPNRC APHLEWNPDLDCSCRDGEKCAHHSHHFSLDIDVGCTDLNEDLGVWVIFKIKTQDGHARLGNLEFLEEKPLVGEALARVKR // ID 0223489-0068 PS TBD OO huringiensis OS ringiensis OX SI 68 RA RL 2010. OKAYAMA UNIVERSITY,JAPAN LAMB CO LTD FT source 1..1176 MT AC 67106 SV CT PN 013789 PT PRN METHOD, FUSION PROTEIN, AND ANTISERUM PA AMERSITY,JAMB CO LTD. PI HAYAKAWA TORU (JP) SAKAI, HIROSHI, HAYAKAWA, TORU P8 P4 10013789 P5 0 PC International Classification: \nUS Classification: \nEuropean Classification: C12N15/62; C07K14/47A25 PR 80199166; PE 199166 AN 09JP63603 KC 1 P1 ng the DNA into a host bacterium to transform the host bacterium; and (c) causing the expression of the fusion protein in the transformed host bacterium.; The method may further comprise a step of removing the peptide chain (B) from the fusion protein. \n \n P7 P9 112 PO PM 10013789; PB 10013789 PQ 10013789; EM esentative W1 PRT D1 0204 D2 0217 D3 0730 D4 0801 D5 0204 HL [L[P9_GQ;0;3,WO2010013789,45,67]] [L[PM_PN_GQNUC;0;12,WO2010013789]] [L[PQ_PN_GQNUC;0;12,WO2010013789]] CC mer C1-1-f FH Key Location/Qualifiers Copyright (c)Inc. 2011 LS Application L2 Publ. Of int. appl. w4 VLGGERIETGYTPIDISLSLTQFLLSEFVPGAGFVLGLVDIIWGIFGPSQWDAFPVQI EQLINQRIEEFARNQAISRLEGLSNLYQIYAESFREWEADPTNPALREEMRIQFNDMNSALTTAIPLLAVQNYQVPLLSV LLGTFDECYPTYLYQKIDESKLKAYTRYQLRGYIEDSQDLEIYLIRYNAKHETVNVPGTGSLWPLSAQSPIGKCGEPNRC APHLEWNPDLDCSCRDGEKCAHHSHHFSLDIDVGCTDLNEDLGVWVIFKIKTQDGHARLGNLEFLEEKPLVGEALARVKR
The output should be in fasta format which consists of lines starting with ID, PT, PA and Sequence. "//" the two slashes are dividing lines between two EMBL genes.
Code:
>013789-0068 ; PROTEIN PRODUCTION METHOD, FUSION PROTEIN, AND ANTISERUM PA ; AMA UNIVERSITY,JAPAN LAMB CO LTD. MDNNPNINECIPYNCLSNPEVEVLGGERIETGYTPIDISLSLTQFLLSEFVPGAGFVLGLVDIIWGIFGPSQWDAFPVQI EQLINQRIEEFARNQAISRLEGLSNLYQIYAESFREWEADPTNPALREEMRIQFNDMNSALTTAIPLLAVQNYQVPLLSV YVQAANLHLSVLRDVSVFGQRWGFDAATINSRYNDLTRLIGNYTDYAVRWYNTGLERVWGPDSRDWVRYNQFRRELTLTV LDIVALFSNYDSRRYPIRTVSQLTREIYTNPVLENFDGSFRGMAQRIEQNIRQPHLMDILNSITIYTDVHRGFNYWSGHQ ITASPVGFSGPEFAFPLFGNAGNAAPPVLVSLTGLGIFRTLSSPLYRRIILGSGPNNQELFVLDGTEFSFASLTTNLPST IYRQRGTVDSLDVIPPQDNSVPPRAGFSHRLSHVTMLSQAAGAVYTLRAPTFSWQHRSAEFNNIIPSSQITQIPLTKSTN LGSGTSVVKGPGFTGGDILRRTSPGQISTLRVNITAPLSQRYRVRIRYASTTNLQFHTSIDGRPINQGNFSATMSSGSNL QSGSFRTVGFTTPFNFSNGSSVFTLSAHVFNSGNEVYIDRIEFVPAEVTFEAEYDLERAQKAVNELFTSSNQIGLKTDVT DYHIDQVSNLVECLSDEFCLDEKQELSEKVKHAKRLSDERNLLQDPNFRGINRQLDRGWRGSTDITIQGGDDVFKENYVT LLGTFDECYPTYLYQKIDESKLKAYTRYQLRGYIEDSQDLEIYLIRYNAKHETVNVPGTGSLWPLSAQSPIGKCGEPNRC APHLEWNPDLDCSCRDGEKCAHHSHHFSLDIDVGCTDLNEDLGVWVIFKIKTQDGHARLGNLEFLEEKPLVGEALARVKR >0223489-0068 ; PRN METHOD, FUSION PROTEIN, AND ANTISERUM PA ; AMERSITY,JAMB CO LTD. VLGGERIETGYTPIDISLSLTQFLLSEFVPGAGFVLGLVDIIWGIFGPSQWDAFPVQIMNSALTTAIPLLAVQREEMRIQLE EQLINQRIEEFARNQAISRLEGLSNLYQIYAESFREWEADPTNPALREEMRIQFNDMNSALTTAIPLLAVQNYQVPLLSV LLGTFDECYPTYLYQKIDESKLKAYTRYQLRGYIEDSQDLEIYLIRYNAKHETVNVPGTGSLWPLSAQSPIGKCGEPNRC APHLEWNPDLDCSCRDGEKCAHHSHHFSLDIDVGCTDLNEDLGVWVIFKIKTQDGHARLGNLEFLEEKPLVGEALARVKR
Comment