SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   Bioinformatics (http://seqanswers.com/forums/forumdisplay.php?f=18)
-   -   Help:scaffold 2 contigs! (http://seqanswers.com/forums/showthread.php?t=12993)

rogerholmes.novogene 07-27-2011 02:16 AM

Help:scaffold 2 contigs!
 
Hi:)
I got a scaffold sequence file from SOAPdenovo like:
>scaffold1 13.6
AATCCGCCGAGATGGGCAGATCCCTTGAGCTCANNNNNNNAATCCGCCGAGATGGGCA
.
.


I need a perl script to remove "N"s and split the scaffold into contigs like:

>scaffold1 13.6 contig1
AATCCGCCGAGATGGGCAGATCCCTTGAGCTCA
>scaffold1 13.6 contig2
AATCCGCCGAGATGGGCA
.
.


I'm new in perl programming. Could anyone help me with that?

Thanks!

BENM 07-27-2011 06:32 AM

Try this run in your terminal:

perl -e 'my ($name,$seq)=("","");while(<>){chomp;if ((/\>/)||(eof)){if (($seq ne "")||(eof)){$seq.=$_ if (eof);my @Seq=split /N+/,$seq;if (@Seq>1){for my $i(1..@Seq){print "$name contig$i\n";print $Seq[$i-1],"\n";}}else{print "$name\n$seq\n";}}$name=$_;$seq="";}else{chomp;$seq.=$_;}}' your_seq.fasta

boetsie 07-27-2011 06:33 AM

I once wrote something like this in perl, i've changed it for your purposes. Hope it helps.

Code:

#!/usr/bin/perl
use strict;

my $file = $ARGV[0];
open(IN,$file) || die "Incorrect file $file. Exiting...\n";

my ($seq, $name)=('','');
while(<IN>){
  chomp;
  my $line = $_;
  $seq.= uc($line) if(eof(IN));
  if (/\>(\S+)/ || eof(IN)){
    if($seq ne ''){
      my @seqgaps = split(/[N]{1,}/, $seq);
      if($#seqgaps > 0){
        my $ctgcount=0;
        foreach my $ctgseq (@seqgaps){
          $ctgcount++;
          print "$name contig$ctgcount (size=".length($ctgseq).")\n$ctgseq\n";
        }
      }else{
        print ">$name\n$seq\n";
      }
    }
    $seq='';
    $name = $_;
  }else{
    $seq.=uc($line);
  }
}

Regards,
Boetsie

kmcarr 07-27-2011 07:10 AM

There is a potential problem with the output you requested and produced by the two examples given above. The sequence identifiers are not unique; every sub-contig within scaffold1 will be named "scaffold1" in the resulting FASTA file. Remember that only the first "word" of the definition line is considered the sequence ID and this should be unique for every entry in your file. Adjust the example scripts so that each contig gets a unique name like:

Code:

>scaffold1.1
.......
>scaffold1.2
.......
>scaffold2.1
.......

etc.


BENM 07-27-2011 09:14 AM

Quote:

Originally Posted by boetsie (Post 47452)
I once wrote something like this in perl, i've changed it for your purposes. Hope it helps.

Code:

#!/usr/bin/perl
use strict;

my $file = $ARGV[0];
open(IN,$file) || die "Incorrect file $file. Exiting...\n";

my ($seq, $name)=('','');
while(<IN>){
  chomp;
  my $line = $_;
  $seq.= uc($line) if(eof(IN));
  if (/\>(\S+)/ || eof(IN)){
    if($seq ne ''){
      my @seqgaps = split(/[N]{1,}/, $seq);
      if($#seqgaps > 0){
        my $ctgcount=0;
        foreach my $ctgseq (@seqgaps){
          $ctgcount++;
          print "$name contig$ctgcount (size=".length($ctgseq).")\n$ctgseq\n";
        }
      }else{
        print ">$name\n$seq\n";
      }
    }
    $seq='';
    $name = $_;
  }else{
    $seq.=uc($line);
  }
}

Regards,
Boetsie

Hi boetsie, you would find out you missed the last sequence.

boetsie 07-27-2011 10:29 AM

Quote:

Originally Posted by BENM (Post 47478)
Hi boetsie, you would find out you missed the last sequence.

And why is that? i don't think so...

BENM 07-27-2011 11:04 AM

Quote:

Originally Posted by boetsie (Post 47482)
And why is that? i don't think so...

Yes, you're right. Sorry, I didn't pay attention to your code.

rogerholmes.novogene 07-27-2011 05:06 PM

You guys are great!
Thanks!

gringer 07-27-2011 11:46 PM

Here's the script I have used for breaking up 454 assemblies for de-novo sequencing (I didn't like the 'replace N with A' thing that was being done). It splits by sequences of N and appends /<num> to the end of split contigs. You can give it an optional -min parameter to make sure the resulting sequences have at least that length:

Code:

#!/usr/bin/perl

# contiguous_fasta.pl -- splits fasta-formatted files into contiguous
# sequences of non-ambiguous bases

# Author: David Eccles (gringer) 2011 <david.eccles@mpi-muenster.mpg.de>

use warnings;
use strict;

my $id = "";

my $lineCount = 0;
my $goodCount = 0;

my $allSequence = "";
my @sequences = ();

my $minLength = 0;
if(@ARGV && ($ARGV[0] eq '-min')){
    shift(@ARGV);
    $minLength = shift(@ARGV);
}

print(STDERR "Splitting sequences into contiguous non-ambiguous sequences...");

while(<>){
    if(++$lineCount % 10 ** 6 == 0){
        print(STDERR " ");
    }
    chomp;
    if(substr($_,0,1) eq ">"){
        my $newID = $_;
        if($allSequence){
            @sequences = split(/N+/, $allSequence);
            $allSequence = "";
            my $seqNum = 0;
            foreach my $sequence (@sequences){
                if(++$goodCount % 10 ** 6 == 0){
                    print(STDERR ".");
                }
                if(!$minLength || (length($sequence) > $minLength)){
                    print("$id/".$seqNum++."\n$sequence\n");
                }
            }
        }
        $id = $newID;
    } else {
        $allSequence .= $_;
    }
}

if($allSequence){
    @sequences = split(/N+/, $allSequence);
    my $seqNum = 0;
    foreach my $sequence (@sequences){
        if($minLength && (length($sequence) > $minLength)){
            print("$id/".$seqNum++."\n$sequence\n");
        }
    }
}

print(STDERR " done\n");


gringer 07-28-2011 12:20 AM

Here's an alternate version that does away with the duplicated code and progress output (but also doesn't handle a minimum length -- you could pipe through fastx_clipper for that if necessary):

Code:

#!/usr/bin/perl
# contiguous_fasta.pl -- splits fasta-formatted files into contiguous
# sequences of non-ambiguous bases
# Author: David Eccles (gringer) 2011 <david.eccles@mpi-muenster.mpg.de>

use warnings;
use strict;

my $id = "";
my $first = 1; # true
my @sequences = ();
my $seqNum = 0;

while(<>){
    chomp;
    if(substr($_,0,1) eq ">"){
        $id = $_;
        $seqNum = 0;
        print((($first)?"":"\n").$_."/".$seqNum++."\n");
        $first = 0; # false
    } else {
        @sequences = split(/N+/, $_);
        print(shift(@sequences)) unless !@sequences; # whole line could be N
        foreach my $sequence (@sequences){
            print("\n$id/".$seqNum++."\n$sequence");
        }
    }
}
print("\n");


boetsie 07-28-2011 12:26 AM

If you want to split only on gaps of say larger than 10 N's, you can change the line;

my @seqgaps = split(/[N]{1,}/, $seq);
to
my @seqgaps = split(/[N]{10,}/, $seq);

rogerholmes.novogene 07-28-2011 10:05 PM

Thanks !
They really helped!


All times are GMT -8. The time now is 02:18 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.