SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Trimming WG BAM file for exome tahamasoodi Bioinformatics 0 11-16-2012 11:58 PM
Convert WIG file into Fasta file kumardeep Bioinformatics 3 08-23-2012 04:56 AM
454 fasta qual description format NGS QC toolkit Quality Trimming pepperoni Bioinformatics 1 02-24-2012 08:55 AM
Whitespace in FASTA file mnkyboy Bioinformatics 2 11-13-2011 07:49 PM
Tophat - fasta file ytmnd85 Bioinformatics 0 01-19-2010 12:38 PM

Reply
 
Thread Tools
Old 03-04-2013, 12:35 PM   #1
baika
Member
 
Location: USA

Join Date: Apr 2012
Posts: 12
Default trimming FASTA file

Hi All
I have a multi-fasta file with approximately 2000 sequences of varying length that have different start and end regions. I need to trim all my sequences in a way that all the sequences start with “ATAGCCGGCACCCTGGT” and ends with “GGCCATATGAGTGGGCC”. Any script would be really helpful to remove bases upstream of ATAGCCGGCACCCTGGT and downstream of GGCCATATGAGTGGGCC so that all the sequences become of equal length, and have the same start and end.

Thanks for your help

Baika
baika is offline   Reply With Quote
Old 03-04-2013, 01:05 PM   #2
krobison
Senior Member
 
Location: Boston area

Join Date: Nov 2007
Posts: 747
Default

Code:
#!/usr/bin/perl
use strict;
use Bio::SeqIO;

## debugging/tuning left as exercise for student :-)

my $reader=new Bio::SeqIO(-format=>'fasta',-file=>$ARGV[0]);
my $writer=new Bio::SeqIO(-format=>'fasta',-file=>$ARGV[0]."trimmed");
while (my $rec=$reader->next_seq)
{
  ## works only on forward strand & 0 mismatches!!!!
   if ($seq->seq=~/(ATAGCCGGCACCCTGGT.*GGCCATATGAGTGGGCC)/i)
   {
   $rec->seq($1);  
   $writer->write_seq($rec);
  } 
  else
  {
     print STDERR "Could not find head and/or tail sequences for ",$seq->id,"\n";
   }
}
krobison is offline   Reply With Quote
Old 03-04-2013, 01:34 PM   #3
baika
Member
 
Location: USA

Join Date: Apr 2012
Posts: 12
Default

Thanks Krobison for the perl script. It gives an error-
Global symbol "$seq" requires explicit package name at ../../Scripts/trim_fastaseq.pl line 12.
Global symbol "$seq" requires explicit package name at ../../Scripts/trim_fastaseq.pl line 19.

Baika

Last edited by baika; 03-04-2013 at 01:35 PM. Reason: wrong ID
baika is offline   Reply With Quote
Old 03-04-2013, 06:25 PM   #4
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,178
Default

Sorry baika; I was going to answer until I saw this line in Keith's code:

Quote:
## debugging/tuning left as exercise for student :-)
Keith, did you stick the bug in there on purpose?
kmcarr is offline   Reply With Quote
Old 03-04-2013, 08:12 PM   #5
A_Morozov
Member
 
Location: Russia, Irkutsk

Join Date: Feb 2011
Posts: 40
Default

No fun, guys. Baika might be (and claims to be in introduction section) a non-bioniformatitian stuck with what is not his/her area of expertise. If you went to genomics/bioinformatics lab, you should at least learn some Perl/Python, but for now - seems like it should be $rec, not $seq under regexp (the scary thingy with // and lots of uppercase).
A_Morozov is offline   Reply With Quote
Old 03-05-2013, 02:47 AM   #6
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,178
Default

Quote:
Originally Posted by A_Morozov View Post
No fun, guys. Baika might be (and claims to be in introduction section) a non-bioniformatitian stuck with what is not his/her area of expertise. If you went to genomics/bioinformatics lab, you should at least learn some Perl/Python, but for now - seems like it should be $rec, not $seq under regexp (the scary thingy with // and lots of uppercase).
Oh, it was late and I was feeling a tad impish; I wasn't going to leave baika hanging long. Anyway the better solution is to change line #9, naming the first object $seq.

Code:
Change

while (my $rec=$reader->next_seq)

to

while (my $seq=$reader->next_seq)
kmcarr is offline   Reply With Quote
Old 03-05-2013, 04:57 AM   #7
d1antho
Member
 
Location: Ireland

Join Date: Mar 2012
Posts: 15
Default

Not that it should make a difference, because it is unlikely to match in the middle of a sequence, but the the match operator should be bounded to the start and the end of the sequence read. But honestly I doubt it should make any difference at all

So this
Code:
if ($seq->seq=~/(ATAGCCGGCACCCTGGT.*GGCCATATGAGTGGGCC)/i)
   {
   $rec->seq($1);  
   $writer->write_seq($rec);
  }
Should then be
Code:
if ($seq->seq=~/(^ATAGCCGGCACCCTGGT.*GGCCATATGAGTGGGCC$)/i)
   {
   $rec->seq($1);  
   $writer->write_seq($rec);
  }
d1antho is offline   Reply With Quote
Old 03-05-2013, 10:03 AM   #8
baika
Member
 
Location: USA

Join Date: Apr 2012
Posts: 12
Default finally working

Thanks krobison for writing this script, and kmcarr for pointing out the error. After incorporating all your suggestions and help from my friend Robert, finally it is working.

Thank you all

baika

Code:
#!/usr/bin/perl -w

#Usage: trim_fasta.pl YOUR_FASTA_FILE.fasta OUT_FILE_TRIMMED.FASTA

use strict;
use Bio::SeqIO;

my $reader=new Bio::SeqIO(-format=>'fasta',-file=>$ARGV[0]);
my $writer=new Bio::SeqIO(-format=>'fasta',-file=>">$ARGV[1]");
while (my $seq=$reader->next_seq)
{
  ## works only on forward strand & 0 mismatches!!!!
   if ($seq->seq=~/(CCAGTATTTGGTA.*AGTTGATAACTGGGAA)/i)
   {
   $seq->seq($1);  
   $writer->write_seq($seq);
  } 
  else
  {
     print STDERR "Could not find head and/or tail sequences for ",$seq->id,"\n";
   }
}

Last edited by baika; 03-05-2013 at 10:04 AM. Reason: spelling mistake
baika is offline   Reply With Quote
Reply

Tags
fasta trimming

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:14 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO