View Single Post
Old 07-27-2011, 11:46 PM   #9
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 838
Default

Here's the script I have used for breaking up 454 assemblies for de-novo sequencing (I didn't like the 'replace N with A' thing that was being done). It splits by sequences of N and appends /<num> to the end of split contigs. You can give it an optional -min parameter to make sure the resulting sequences have at least that length:

Code:
#!/usr/bin/perl

# contiguous_fasta.pl -- splits fasta-formatted files into contiguous
# sequences of non-ambiguous bases

# Author: David Eccles (gringer) 2011 <david.eccles@mpi-muenster.mpg.de>

use warnings;
use strict;

my $id = "";

my $lineCount = 0;
my $goodCount = 0;

my $allSequence = "";
my @sequences = ();

my $minLength = 0;
if(@ARGV && ($ARGV[0] eq '-min')){
    shift(@ARGV);
    $minLength = shift(@ARGV);
}

print(STDERR "Splitting sequences into contiguous non-ambiguous sequences...");

while(<>){
    if(++$lineCount % 10 ** 6 == 0){
        print(STDERR " ");
    }
    chomp;
    if(substr($_,0,1) eq ">"){
        my $newID = $_;
        if($allSequence){
            @sequences = split(/N+/, $allSequence);
            $allSequence = "";
            my $seqNum = 0;
            foreach my $sequence (@sequences){
                if(++$goodCount % 10 ** 6 == 0){
                    print(STDERR ".");
                }
                if(!$minLength || (length($sequence) > $minLength)){
                    print("$id/".$seqNum++."\n$sequence\n");
                }
            }
        }
        $id = $newID;
    } else {
        $allSequence .= $_;
    }
}

if($allSequence){
    @sequences = split(/N+/, $allSequence);
    my $seqNum = 0;
    foreach my $sequence (@sequences){
        if($minLength && (length($sequence) > $minLength)){
            print("$id/".$seqNum++."\n$sequence\n");
        }
    }
}

print(STDERR " done\n");

Last edited by gringer; 07-28-2011 at 12:12 AM. Reason: correcting a logic bug
gringer is offline   Reply With Quote