Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help:scaffold 2 contigs!

    Hi
    I got a scaffold sequence file from SOAPdenovo like:
    >scaffold1 13.6
    AATCCGCCGAGATGGGCAGATCCCTTGAGCTCANNNNNNNAATCCGCCGAGATGGGCA
    .
    .


    I need a perl script to remove "N"s and split the scaffold into contigs like:

    >scaffold1 13.6 contig1
    AATCCGCCGAGATGGGCAGATCCCTTGAGCTCA
    >scaffold1 13.6 contig2
    AATCCGCCGAGATGGGCA
    .
    .


    I'm new in perl programming. Could anyone help me with that?

    Thanks!

  • #2
    Try this run in your terminal:

    perl -e 'my ($name,$seq)=("","");while(<>){chomp;if ((/\>/)||(eof)){if (($seq ne "")||(eof)){$seq.=$_ if (eof);my @Seq=split /N+/,$seq;if (@Seq>1){for my $i(1..@Seq){print "$name contig$i\n";print $Seq[$i-1],"\n";}}else{print "$name\n$seq\n";}}$name=$_;$seq="";}else{chomp;$seq.=$_;}}' your_seq.fasta

    Comment


    • #3
      I once wrote something like this in perl, i've changed it for your purposes. Hope it helps.

      Code:
      #!/usr/bin/perl
      use strict;
      
      my $file = $ARGV[0];
      open(IN,$file) || die "Incorrect file $file. Exiting...\n";
      
      my ($seq, $name)=('','');
      while(<IN>){
        chomp;
        my $line = $_;
        $seq.= uc($line) if(eof(IN));
        if (/\>(\S+)/ || eof(IN)){
          if($seq ne ''){
            my @seqgaps = split(/[N]{1,}/, $seq);
            if($#seqgaps > 0){
              my $ctgcount=0;
              foreach my $ctgseq (@seqgaps){
                $ctgcount++;
                print "$name contig$ctgcount (size=".length($ctgseq).")\n$ctgseq\n";
              }
            }else{
              print ">$name\n$seq\n";
            }
          }
          $seq='';
          $name = $_;
        }else{
          $seq.=uc($line);
        }
      }
      Regards,
      Boetsie
      Last edited by boetsie; 07-27-2011, 06:59 AM.

      Comment


      • #4
        There is a potential problem with the output you requested and produced by the two examples given above. The sequence identifiers are not unique; every sub-contig within scaffold1 will be named "scaffold1" in the resulting FASTA file. Remember that only the first "word" of the definition line is considered the sequence ID and this should be unique for every entry in your file. Adjust the example scripts so that each contig gets a unique name like:

        Code:
        >scaffold1.1
        .......
        >scaffold1.2
        .......
        >scaffold2.1
        .......
        
        etc.

        Comment


        • #5
          Originally posted by boetsie View Post
          I once wrote something like this in perl, i've changed it for your purposes. Hope it helps.

          Code:
          #!/usr/bin/perl
          use strict;
          
          my $file = $ARGV[0];
          open(IN,$file) || die "Incorrect file $file. Exiting...\n";
          
          my ($seq, $name)=('','');
          while(<IN>){
            chomp;
            my $line = $_;
            $seq.= uc($line) if(eof(IN));
            if (/\>(\S+)/ || eof(IN)){
              if($seq ne ''){
                my @seqgaps = split(/[N]{1,}/, $seq);
                if($#seqgaps > 0){
                  my $ctgcount=0;
                  foreach my $ctgseq (@seqgaps){
                    $ctgcount++;
                    print "$name contig$ctgcount (size=".length($ctgseq).")\n$ctgseq\n";
                  }
                }else{
                  print ">$name\n$seq\n";
                }
              }
              $seq='';
              $name = $_;
            }else{
              $seq.=uc($line);
            }
          }
          Regards,
          Boetsie
          Hi boetsie, you would find out you missed the last sequence.

          Comment


          • #6
            Originally posted by BENM View Post
            Hi boetsie, you would find out you missed the last sequence.
            And why is that? i don't think so...

            Comment


            • #7
              Originally posted by boetsie View Post
              And why is that? i don't think so...
              Yes, you're right. Sorry, I didn't pay attention to your code.

              Comment


              • #8
                You guys are great!
                Thanks!

                Comment


                • #9
                  Here's the script I have used for breaking up 454 assemblies for de-novo sequencing (I didn't like the 'replace N with A' thing that was being done). It splits by sequences of N and appends /<num> to the end of split contigs. You can give it an optional -min parameter to make sure the resulting sequences have at least that length:

                  Code:
                  #!/usr/bin/perl
                  
                  # contiguous_fasta.pl -- splits fasta-formatted files into contiguous
                  # sequences of non-ambiguous bases
                  
                  # Author: David Eccles (gringer) 2011 <[email protected]>
                  
                  use warnings;
                  use strict;
                  
                  my $id = "";
                  
                  my $lineCount = 0;
                  my $goodCount = 0;
                  
                  my $allSequence = "";
                  my @sequences = ();
                  
                  my $minLength = 0;
                  if(@ARGV && ($ARGV[0] eq '-min')){
                      shift(@ARGV);
                      $minLength = shift(@ARGV);
                  }
                  
                  print(STDERR "Splitting sequences into contiguous non-ambiguous sequences...");
                  
                  while(<>){
                      if(++$lineCount % 10 ** 6 == 0){
                          print(STDERR " ");
                      }
                      chomp;
                      if(substr($_,0,1) eq ">"){
                          my $newID = $_;
                          if($allSequence){
                              @sequences = split(/N+/, $allSequence);
                              $allSequence = "";
                              my $seqNum = 0;
                              foreach my $sequence (@sequences){
                                  if(++$goodCount % 10 ** 6 == 0){
                                      print(STDERR ".");
                                  }
                                  if(!$minLength || (length($sequence) > $minLength)){
                                      print("$id/".$seqNum++."\n$sequence\n");
                                  }
                              }
                          }
                          $id = $newID;
                      } else {
                          $allSequence .= $_;
                      }
                  }
                  
                  if($allSequence){
                      @sequences = split(/N+/, $allSequence);
                      my $seqNum = 0;
                      foreach my $sequence (@sequences){
                          if($minLength && (length($sequence) > $minLength)){
                              print("$id/".$seqNum++."\n$sequence\n");
                          }
                      }
                  }
                  
                  print(STDERR " done\n");
                  Last edited by gringer; 07-28-2011, 12:12 AM. Reason: correcting a logic bug

                  Comment


                  • #10
                    Here's an alternate version that does away with the duplicated code and progress output (but also doesn't handle a minimum length -- you could pipe through fastx_clipper for that if necessary):

                    Code:
                    #!/usr/bin/perl
                    # contiguous_fasta.pl -- splits fasta-formatted files into contiguous
                    # sequences of non-ambiguous bases
                    # Author: David Eccles (gringer) 2011 <[email protected]>
                    
                    use warnings;
                    use strict;
                    
                    my $id = "";
                    my $first = 1; # true
                    my @sequences = ();
                    my $seqNum = 0;
                    
                    while(<>){
                        chomp;
                        if(substr($_,0,1) eq ">"){
                            $id = $_;
                            $seqNum = 0;
                            print((($first)?"":"\n").$_."/".$seqNum++."\n");
                            $first = 0; # false
                        } else {
                            @sequences = split(/N+/, $_);
                            print(shift(@sequences)) unless !@sequences; # whole line could be N
                            foreach my $sequence (@sequences){
                                print("\n$id/".$seqNum++."\n$sequence");
                            }
                        }
                    }
                    print("\n");

                    Comment


                    • #11
                      If you want to split only on gaps of say larger than 10 N's, you can change the line;

                      my @seqgaps = split(/[N]{1,}/, $seq);
                      to
                      my @seqgaps = split(/[N]{10,}/, $seq);

                      Comment


                      • #12
                        Thanks !
                        They really helped!

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Strategies for Sequencing Challenging Samples
                          by seqadmin


                          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                          03-22-2024, 06:39 AM
                        • seqadmin
                          Techniques and Challenges in Conservation Genomics
                          by seqadmin



                          The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                          Avian Conservation
                          Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                          03-08-2024, 10:41 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, Yesterday, 06:37 PM
                        0 responses
                        10 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, Yesterday, 06:07 PM
                        0 responses
                        10 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-22-2024, 10:03 AM
                        0 responses
                        51 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-21-2024, 07:32 AM
                        0 responses
                        67 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X