SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Perl script help please - combining files AmyEllison Bioinformatics 7 04-16-2013 05:24 AM
How to connect shared sequences in two large fasta files shanshuiii Bioinformatics 0 03-31-2013 05:28 PM
Any script to format headers in fasta files? Shishir Bioinformatics 2 02-05-2013 06:52 AM
Split fastq into smaller files lorendarith Bioinformatics 10 12-13-2012 04:28 AM
repeat sequences/large files in galaxy Giles Bioinformatics 2 06-27-2011 11:08 AM

Reply
 
Thread Tools
Old 02-21-2014, 12:33 PM   #1
lac302
Member
 
Location: DE

Join Date: Dec 2012
Posts: 65
Default Script for breaking large .fa files into smaller files of [N] sequences

I'm working on my first WGS assembly submission to NCBI. It's proving to be harder than the assembly itself . NCBI requires assemblies containing more than 20k contigs to be split into chunks of 10k. One of the assemblies I'm submitting has over 800k contigs.

Anyone out there have a script that could handle this? Thanks in advance.
lac302 is offline   Reply With Quote
Old 02-21-2014, 01:32 PM   #2
JackieBadger
Senior Member
 
Location: Halifax, Nova Scotia

Join Date: Mar 2009
Posts: 381
Default

So you essentially have a fasta file with >800k unique sequences?

You can just use the cat or split function in linux
JackieBadger is offline   Reply With Quote
Old 02-21-2014, 02:13 PM   #3
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,978
Default

http://stackoverflow.com/questions/1...-of-first-line

http://www.biostars.org/p/13270/

There are multiple solutions in the links. You may need to root around to find one that does what you want.

Last edited by GenoMax; 02-21-2014 at 02:16 PM.
GenoMax is offline   Reply With Quote
Old 02-21-2014, 04:49 PM   #4
rnaeye
Member
 
Location: East Cost

Join Date: May 2011
Posts: 79
Default

Hi!,
You can use script at http://stackoverflow.com/questions/1...nto-many-files

I modified the script little bit and pasted below. If you want bigger chunks, just change "my $record_per_file = 4" value to any chunk size you like.

Code:
#!/usr/bin/perl
use warnings;
use 5.12.4;
use File::Basename;
#Split FASTA files into chunks determined by user.
#by RNAeye

my $file = "input.fa"; 		#enter name of your FASTA file here
my $record_per_file = 4;	#Enter how many record you want per file / chunk size
my $file_number = 1;		#this is going to be a part of your new file names.
my $counter = 0;			#counts number of records

open (FASTA,  "<", "$file" )   or die "Cannot open file $file $!";

while (<FASTA>) {
	if (/^>/) { 
		if ($counter++ % $record_per_file == 0) {
			my $basename = basename($file);
			my $new_file_name = $basename. $file_number++ . ".fa"; 
			close(NEW_FASTA);
			open(NEW_FASTA, ">", $new_file_name) or die "Cannot open file $new_file_name $!";
		}
	}
	print NEW_FASTA $_;	
}
rnaeye is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:33 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO