SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
reads mapping across scaffold boundaries lukas1848 Bioinformatics 6 05-30-2016 11:13 PM
How To: Contig to Scaffold? fahmida Bioinformatics 3 08-22-2013 04:41 AM
scaffold and contig chengeng Bioinformatics 5 07-04-2013 06:15 AM
Calculating the number of contigs in a scaffold file avtsanger Bioinformatics 5 01-06-2011 12:43 AM
Scaffold using Phusion assembler e-summer-3 Bioinformatics 0 12-05-2010 08:59 PM

Reply
 
Thread Tools
Old 07-27-2011, 02:16 AM   #1
rogerholmes.novogene
Junior Member
 
Location: Beijing, China

Join Date: Jul 2011
Posts: 3
Default Help:scaffold 2 contigs!

Hi
I got a scaffold sequence file from SOAPdenovo like:
>scaffold1 13.6
AATCCGCCGAGATGGGCAGATCCCTTGAGCTCANNNNNNNAATCCGCCGAGATGGGCA
.
.


I need a perl script to remove "N"s and split the scaffold into contigs like:

>scaffold1 13.6 contig1
AATCCGCCGAGATGGGCAGATCCCTTGAGCTCA
>scaffold1 13.6 contig2
AATCCGCCGAGATGGGCA
.
.


I'm new in perl programming. Could anyone help me with that?

Thanks!
rogerholmes.novogene is offline   Reply With Quote
Old 07-27-2011, 06:32 AM   #2
BENM
Member
 
Location: PRC

Join Date: May 2009
Posts: 33
Default

Try this run in your terminal:

perl -e 'my ($name,$seq)=("","");while(<>){chomp;if ((/\>/)||(eof)){if (($seq ne "")||(eof)){$seq.=$_ if (eof);my @Seq=split /N+/,$seq;if (@Seq>1){for my $i(1..@Seq){print "$name contig$i\n";print $Seq[$i-1],"\n";}}else{print "$name\n$seq\n";}}$name=$_;$seq="";}else{chomp;$seq.=$_;}}' your_seq.fasta
BENM is offline   Reply With Quote
Old 07-27-2011, 06:33 AM   #3
boetsie
Senior Member
 
Location: NL, Leiden

Join Date: Feb 2010
Posts: 245
Default

I once wrote something like this in perl, i've changed it for your purposes. Hope it helps.

Code:
#!/usr/bin/perl
use strict;

my $file = $ARGV[0];
open(IN,$file) || die "Incorrect file $file. Exiting...\n";

my ($seq, $name)=('','');
while(<IN>){
  chomp;
  my $line = $_;
  $seq.= uc($line) if(eof(IN));
  if (/\>(\S+)/ || eof(IN)){
    if($seq ne ''){
      my @seqgaps = split(/[N]{1,}/, $seq);
      if($#seqgaps > 0){
        my $ctgcount=0;
        foreach my $ctgseq (@seqgaps){
          $ctgcount++;
          print "$name contig$ctgcount (size=".length($ctgseq).")\n$ctgseq\n";
        }
      }else{
        print ">$name\n$seq\n";
      }
    }
    $seq='';
    $name = $_;
  }else{
    $seq.=uc($line);
  }
}
Regards,
Boetsie

Last edited by boetsie; 07-27-2011 at 06:59 AM.
boetsie is offline   Reply With Quote
Old 07-27-2011, 07:10 AM   #4
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,178
Default

There is a potential problem with the output you requested and produced by the two examples given above. The sequence identifiers are not unique; every sub-contig within scaffold1 will be named "scaffold1" in the resulting FASTA file. Remember that only the first "word" of the definition line is considered the sequence ID and this should be unique for every entry in your file. Adjust the example scripts so that each contig gets a unique name like:

Code:
>scaffold1.1
.......
>scaffold1.2
.......
>scaffold2.1
.......

etc.
kmcarr is offline   Reply With Quote
Old 07-27-2011, 09:14 AM   #5
BENM
Member
 
Location: PRC

Join Date: May 2009
Posts: 33
Default

Quote:
Originally Posted by boetsie View Post
I once wrote something like this in perl, i've changed it for your purposes. Hope it helps.

Code:
#!/usr/bin/perl
use strict;

my $file = $ARGV[0];
open(IN,$file) || die "Incorrect file $file. Exiting...\n";

my ($seq, $name)=('','');
while(<IN>){
  chomp;
  my $line = $_;
  $seq.= uc($line) if(eof(IN));
  if (/\>(\S+)/ || eof(IN)){
    if($seq ne ''){
      my @seqgaps = split(/[N]{1,}/, $seq);
      if($#seqgaps > 0){
        my $ctgcount=0;
        foreach my $ctgseq (@seqgaps){
          $ctgcount++;
          print "$name contig$ctgcount (size=".length($ctgseq).")\n$ctgseq\n";
        }
      }else{
        print ">$name\n$seq\n";
      }
    }
    $seq='';
    $name = $_;
  }else{
    $seq.=uc($line);
  }
}
Regards,
Boetsie
Hi boetsie, you would find out you missed the last sequence.
BENM is offline   Reply With Quote
Old 07-27-2011, 10:29 AM   #6
boetsie
Senior Member
 
Location: NL, Leiden

Join Date: Feb 2010
Posts: 245
Default

Quote:
Originally Posted by BENM View Post
Hi boetsie, you would find out you missed the last sequence.
And why is that? i don't think so...
boetsie is offline   Reply With Quote
Old 07-27-2011, 11:04 AM   #7
BENM
Member
 
Location: PRC

Join Date: May 2009
Posts: 33
Default

Quote:
Originally Posted by boetsie View Post
And why is that? i don't think so...
Yes, you're right. Sorry, I didn't pay attention to your code.
BENM is offline   Reply With Quote
Old 07-27-2011, 05:06 PM   #8
rogerholmes.novogene
Junior Member
 
Location: Beijing, China

Join Date: Jul 2011
Posts: 3
Default

You guys are great!
Thanks!
rogerholmes.novogene is offline   Reply With Quote
Old 07-27-2011, 11:46 PM   #9
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 838
Default

Here's the script I have used for breaking up 454 assemblies for de-novo sequencing (I didn't like the 'replace N with A' thing that was being done). It splits by sequences of N and appends /<num> to the end of split contigs. You can give it an optional -min parameter to make sure the resulting sequences have at least that length:

Code:
#!/usr/bin/perl

# contiguous_fasta.pl -- splits fasta-formatted files into contiguous
# sequences of non-ambiguous bases

# Author: David Eccles (gringer) 2011 <david.eccles@mpi-muenster.mpg.de>

use warnings;
use strict;

my $id = "";

my $lineCount = 0;
my $goodCount = 0;

my $allSequence = "";
my @sequences = ();

my $minLength = 0;
if(@ARGV && ($ARGV[0] eq '-min')){
    shift(@ARGV);
    $minLength = shift(@ARGV);
}

print(STDERR "Splitting sequences into contiguous non-ambiguous sequences...");

while(<>){
    if(++$lineCount % 10 ** 6 == 0){
        print(STDERR " ");
    }
    chomp;
    if(substr($_,0,1) eq ">"){
        my $newID = $_;
        if($allSequence){
            @sequences = split(/N+/, $allSequence);
            $allSequence = "";
            my $seqNum = 0;
            foreach my $sequence (@sequences){
                if(++$goodCount % 10 ** 6 == 0){
                    print(STDERR ".");
                }
                if(!$minLength || (length($sequence) > $minLength)){
                    print("$id/".$seqNum++."\n$sequence\n");
                }
            }
        }
        $id = $newID;
    } else {
        $allSequence .= $_;
    }
}

if($allSequence){
    @sequences = split(/N+/, $allSequence);
    my $seqNum = 0;
    foreach my $sequence (@sequences){
        if($minLength && (length($sequence) > $minLength)){
            print("$id/".$seqNum++."\n$sequence\n");
        }
    }
}

print(STDERR " done\n");

Last edited by gringer; 07-28-2011 at 12:12 AM. Reason: correcting a logic bug
gringer is offline   Reply With Quote
Old 07-28-2011, 12:20 AM   #10
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 838
Default

Here's an alternate version that does away with the duplicated code and progress output (but also doesn't handle a minimum length -- you could pipe through fastx_clipper for that if necessary):

Code:
#!/usr/bin/perl
# contiguous_fasta.pl -- splits fasta-formatted files into contiguous
# sequences of non-ambiguous bases
# Author: David Eccles (gringer) 2011 <david.eccles@mpi-muenster.mpg.de>

use warnings;
use strict;

my $id = "";
my $first = 1; # true
my @sequences = ();
my $seqNum = 0;

while(<>){
    chomp;
    if(substr($_,0,1) eq ">"){
        $id = $_;
        $seqNum = 0;
        print((($first)?"":"\n").$_."/".$seqNum++."\n");
        $first = 0; # false
    } else {
        @sequences = split(/N+/, $_);
        print(shift(@sequences)) unless !@sequences; # whole line could be N
        foreach my $sequence (@sequences){
            print("\n$id/".$seqNum++."\n$sequence");
        }
    }
}
print("\n");
gringer is offline   Reply With Quote
Old 07-28-2011, 12:26 AM   #11
boetsie
Senior Member
 
Location: NL, Leiden

Join Date: Feb 2010
Posts: 245
Default

If you want to split only on gaps of say larger than 10 N's, you can change the line;

my @seqgaps = split(/[N]{1,}/, $seq);
to
my @seqgaps = split(/[N]{10,}/, $seq);
boetsie is offline   Reply With Quote
Old 07-28-2011, 10:05 PM   #12
rogerholmes.novogene
Junior Member
 
Location: Beijing, China

Join Date: Jul 2011
Posts: 3
Default

Thanks !
They really helped!
rogerholmes.novogene is offline   Reply With Quote
Reply

Tags
perl

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:12 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO