Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
Perl script help please - combining files AmyEllison Bioinformatics 7 04-16-2013 05:24 AM
How to connect shared sequences in two large fasta files shanshuiii Bioinformatics 0 03-31-2013 05:28 PM
Any script to format headers in fasta files? Shishir Bioinformatics 2 02-05-2013 06:52 AM
Split fastq into smaller files lorendarith Bioinformatics 10 12-13-2012 04:28 AM
repeat sequences/large files in galaxy Giles Bioinformatics 2 06-27-2011 11:08 AM

Thread Tools
Old 02-21-2014, 12:33 PM   #1
Location: DE

Join Date: Dec 2012
Posts: 65
Default Script for breaking large .fa files into smaller files of [N] sequences

I'm working on my first WGS assembly submission to NCBI. It's proving to be harder than the assembly itself . NCBI requires assemblies containing more than 20k contigs to be split into chunks of 10k. One of the assemblies I'm submitting has over 800k contigs.

Anyone out there have a script that could handle this? Thanks in advance.
lac302 is offline   Reply With Quote
Old 02-21-2014, 01:32 PM   #2
Senior Member
Location: Halifax, Nova Scotia

Join Date: Mar 2009
Posts: 381

So you essentially have a fasta file with >800k unique sequences?

You can just use the cat or split function in linux
JackieBadger is offline   Reply With Quote
Old 02-21-2014, 02:13 PM   #3
Senior Member
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,978

There are multiple solutions in the links. You may need to root around to find one that does what you want.

Last edited by GenoMax; 02-21-2014 at 02:16 PM.
GenoMax is offline   Reply With Quote
Old 02-21-2014, 04:49 PM   #4
Location: East Cost

Join Date: May 2011
Posts: 79

You can use script at

I modified the script little bit and pasted below. If you want bigger chunks, just change "my $record_per_file = 4" value to any chunk size you like.

use warnings;
use 5.12.4;
use File::Basename;
#Split FASTA files into chunks determined by user.
#by RNAeye

my $file = "input.fa"; 		#enter name of your FASTA file here
my $record_per_file = 4;	#Enter how many record you want per file / chunk size
my $file_number = 1;		#this is going to be a part of your new file names.
my $counter = 0;			#counts number of records

open (FASTA,  "<", "$file" )   or die "Cannot open file $file $!";

while (<FASTA>) {
	if (/^>/) { 
		if ($counter++ % $record_per_file == 0) {
			my $basename = basename($file);
			my $new_file_name = $basename. $file_number++ . ".fa"; 
			open(NEW_FASTA, ">", $new_file_name) or die "Cannot open file $new_file_name $!";
	print NEW_FASTA $_;	
rnaeye is offline   Reply With Quote

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

All times are GMT -8. The time now is 10:33 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO