SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
GFF3 to GenBank convert sphil Bioinformatics 4 05-18-2012 08:59 AM
GFF 2 genbank converter deMan Bioinformatics 3 02-16-2012 02:33 PM
genbank2gff.pl (Genbank 2 GFF problem) mcastell Bioinformatics 1 12-16-2011 07:26 AM
samtools tview produces "Floating point exception" on big file? pmaugeri Bioinformatics 2 10-28-2011 09:06 AM
Splitting 454 paired reads in a FASTQ file sjackman Bioinformatics 5 09-10-2010 12:09 PM

Reply
 
Thread Tools
Old 03-19-2012, 11:53 AM   #1
joscarhuguet
Member
 
Location: USA

Join Date: Feb 2010
Posts: 18
Default splitting big genbank file

I have a big gbk file containing multiple gbks, is there any simple way to split this big gbk into small gbks Thanks.
joscarhuguet is offline   Reply With Quote
Old 03-19-2012, 12:24 PM   #2
nickloman
Senior Member
 
Location: Birmingham, UK

Join Date: Jul 2009
Posts: 356
Default

I haven't tried it out but 'seqretsplit' from the EMBOSS package might do what you want. Otherwise it's a quick script in Bioperl or Biopython, e.g. in BioPython (untested)

Run like python splitgbk.py < input.gbk

Will create a file for each entry in the current directory.

-- splitgbk.py

Code:
from Bio import SeqIO
import sys

for rec in SeqIO.parse(sys.stdin, "genbank"):
   SeqIO.write([rec], open(rec.id + ".gbk", "w"), "genbank")
nickloman is offline   Reply With Quote
Old 03-19-2012, 12:27 PM   #3
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

If you want one file per record, try EMBOSS seqret and the -ossingle_outseq option.
http://emboss.open-bio.org/wiki/Appdoc:Seqret
EDIT: That probably does the same as EMBOSS seqretsplit suggested by Nick while I was writing this.
http://emboss.open-bio.org/wiki/Appdoc:Seqretsplit

Do you just want to break it up into batches, say 10 records in each file? Or, do you have a particular order in mind (which could involve either sorting or random access).

Last edited by maubp; 03-19-2012 at 12:28 PM. Reason: Nick posted at same time
maubp is offline   Reply With Quote
Old 03-19-2012, 12:31 PM   #4
nickloman
Senior Member
 
Location: Birmingham, UK

Join Date: Jul 2009
Posts: 356
Default

Come on Peter, I caught you napping again
nickloman is offline   Reply With Quote
Old 03-19-2012, 01:02 PM   #5
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 700
Default split genbank files using awk

awk -v n=1 '/^\/\//{close("out"n);n++;next} {print > "out"n}' yourfilename.gbk

Split yourfilename.gbk into multiple files by splitting at "//" (end of record) line.

Last edited by Richard Finney; 03-20-2012 at 10:51 AM.
Richard Finney is offline   Reply With Quote
Old 05-17-2013, 07:04 AM   #6
thmourikis
Junior Member
 
Location: Norwich, UK

Join Date: May 2013
Posts: 8
Default

Hi all,

I have the same problem but I want to split the file every 1000 entries. My file has 500,000 records and I want 500 files of 1000 records each. Any suggestions?

Thanks in advance.
Thanos
thmourikis is offline   Reply With Quote
Old 05-17-2013, 07:19 AM   #7
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

Thanos - which scripting languages do you know? GenBank records end with a // line (which is what Richard's awk command exploits) so it is very simple to split up a file into sub-files named however you like using Perl, Python or Ruby.
maubp is offline   Reply With Quote
Old 05-17-2013, 07:29 AM   #8
thmourikis
Junior Member
 
Location: Norwich, UK

Join Date: May 2013
Posts: 8
Default

Hi Peter and thank you for your immediate reply.

I currently use Perl (not very experienced though). I guess I can try to alter Richard's awk command and implement it in a Perl script for renaming etc.

Thank you once again.
thmourikis is offline   Reply With Quote
Old 05-20-2013, 07:36 AM   #9
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 700
Default

awk -v n=1 -v p=0 '/^\/\//{p++;if(((p%1000)==0)&&(p!=0)){close("out"n);n++;next}} {print > "out"n}' yourfilename.gbk

splits at 1000 records.
Richard Finney is offline   Reply With Quote
Old 05-20-2013, 07:39 AM   #10
thmourikis
Junior Member
 
Location: Norwich, UK

Join Date: May 2013
Posts: 8
Default

Thanks a lot Richard! I really appreciate that!

Best,
Thanos
thmourikis is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:25 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO