Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • trimming FASTA file

    Hi All
    I have a multi-fasta file with approximately 2000 sequences of varying length that have different start and end regions. I need to trim all my sequences in a way that all the sequences start with “ATAGCCGGCACCCTGGT” and ends with “GGCCATATGAGTGGGCC”. Any script would be really helpful to remove bases upstream of ATAGCCGGCACCCTGGT and downstream of GGCCATATGAGTGGGCC so that all the sequences become of equal length, and have the same start and end.

    Thanks for your help

    Baika

  • #2
    Code:
    #!/usr/bin/perl
    use strict;
    use Bio::SeqIO;
    
    ## debugging/tuning left as exercise for student :-)
    
    my $reader=new Bio::SeqIO(-format=>'fasta',-file=>$ARGV[0]);
    my $writer=new Bio::SeqIO(-format=>'fasta',-file=>$ARGV[0]."trimmed");
    while (my $rec=$reader->next_seq)
    {
      ## works only on forward strand & 0 mismatches!!!!
       if ($seq->seq=~/(ATAGCCGGCACCCTGGT.*GGCCATATGAGTGGGCC)/i)
       {
       $rec->seq($1);  
       $writer->write_seq($rec);
      } 
      else
      {
         print STDERR "Could not find head and/or tail sequences for ",$seq->id,"\n";
       }
    }

    Comment


    • #3
      Thanks Krobison for the perl script. It gives an error-
      Global symbol "$seq" requires explicit package name at ../../Scripts/trim_fastaseq.pl line 12.
      Global symbol "$seq" requires explicit package name at ../../Scripts/trim_fastaseq.pl line 19.

      Baika
      Last edited by baika; 03-04-2013, 02:35 PM. Reason: wrong ID

      Comment


      • #4
        Sorry baika; I was going to answer until I saw this line in Keith's code:

        ## debugging/tuning left as exercise for student :-)
        Keith, did you stick the bug in there on purpose?

        Comment


        • #5
          No fun, guys. Baika might be (and claims to be in introduction section) a non-bioniformatitian stuck with what is not his/her area of expertise. If you went to genomics/bioinformatics lab, you should at least learn some Perl/Python, but for now - seems like it should be $rec, not $seq under regexp (the scary thingy with // and lots of uppercase).

          Comment


          • #6
            Originally posted by A_Morozov View Post
            No fun, guys. Baika might be (and claims to be in introduction section) a non-bioniformatitian stuck with what is not his/her area of expertise. If you went to genomics/bioinformatics lab, you should at least learn some Perl/Python, but for now - seems like it should be $rec, not $seq under regexp (the scary thingy with // and lots of uppercase).
            Oh, it was late and I was feeling a tad impish; I wasn't going to leave baika hanging long. Anyway the better solution is to change line #9, naming the first object $seq.

            Code:
            Change
            
            while (my $[COLOR="Red"]rec[/COLOR]=$reader->next_seq)
            
            to
            
            while (my $[COLOR="red"]seq[/COLOR]=$reader->next_seq)

            Comment


            • #7
              Not that it should make a difference, because it is unlikely to match in the middle of a sequence, but the the match operator should be bounded to the start and the end of the sequence read. But honestly I doubt it should make any difference at all

              So this
              Code:
              if ($seq->seq=~/(ATAGCCGGCACCCTGGT.*GGCCATATGAGTGGGCC)/i)
                 {
                 $rec->seq($1);  
                 $writer->write_seq($rec);
                }
              Should then be
              Code:
              if ($seq->seq=~/(^ATAGCCGGCACCCTGGT.*GGCCATATGAGTGGGCC$)/i)
                 {
                 $rec->seq($1);  
                 $writer->write_seq($rec);
                }

              Comment


              • #8
                finally working

                Thanks krobison for writing this script, and kmcarr for pointing out the error. After incorporating all your suggestions and help from my friend Robert, finally it is working.

                Thank you all

                baika

                Code:
                #!/usr/bin/perl -w
                
                #Usage: trim_fasta.pl YOUR_FASTA_FILE.fasta OUT_FILE_TRIMMED.FASTA
                
                use strict;
                use Bio::SeqIO;
                
                my $reader=new Bio::SeqIO(-format=>'fasta',-file=>$ARGV[0]);
                my $writer=new Bio::SeqIO(-format=>'fasta',-file=>">$ARGV[1]");
                while (my $seq=$reader->next_seq)
                {
                  ## works only on forward strand & 0 mismatches!!!!
                   if ($seq->seq=~/(CCAGTATTTGGTA.*AGTTGATAACTGGGAA)/i)
                   {
                   $seq->seq($1);  
                   $writer->write_seq($seq);
                  } 
                  else
                  {
                     print STDERR "Could not find head and/or tail sequences for ",$seq->id,"\n";
                   }
                }
                Last edited by baika; 03-05-2013, 11:04 AM. Reason: spelling mistake

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM
                • seqadmin
                  Techniques and Challenges in Conservation Genomics
                  by seqadmin



                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                  Avian Conservation
                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                  03-08-2024, 10:41 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Yesterday, 06:37 PM
                0 responses
                8 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, Yesterday, 06:07 PM
                0 responses
                8 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-22-2024, 10:03 AM
                0 responses
                49 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-21-2024, 07:32 AM
                0 responses
                66 views
                0 likes
                Last Post seqadmin  
                Working...
                X