![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Trimming WG BAM file for exome | tahamasoodi | Bioinformatics | 0 | 11-17-2012 12:58 AM |
Convert WIG file into Fasta file | kumardeep | Bioinformatics | 3 | 08-23-2012 05:56 AM |
454 fasta qual description format NGS QC toolkit Quality Trimming | pepperoni | Bioinformatics | 1 | 02-24-2012 09:55 AM |
Whitespace in FASTA file | mnkyboy | Bioinformatics | 2 | 11-13-2011 08:49 PM |
Tophat - fasta file | ytmnd85 | Bioinformatics | 0 | 01-19-2010 01:38 PM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: USA Join Date: Apr 2012
Posts: 12
|
![]()
Hi All
I have a multi-fasta file with approximately 2000 sequences of varying length that have different start and end regions. I need to trim all my sequences in a way that all the sequences start with “ATAGCCGGCACCCTGGT” and ends with “GGCCATATGAGTGGGCC”. Any script would be really helpful to remove bases upstream of ATAGCCGGCACCCTGGT and downstream of GGCCATATGAGTGGGCC so that all the sequences become of equal length, and have the same start and end. Thanks for your help Baika |
![]() |
![]() |
![]() |
#2 |
Senior Member
Location: Boston area Join Date: Nov 2007
Posts: 747
|
![]() Code:
#!/usr/bin/perl use strict; use Bio::SeqIO; ## debugging/tuning left as exercise for student :-) my $reader=new Bio::SeqIO(-format=>'fasta',-file=>$ARGV[0]); my $writer=new Bio::SeqIO(-format=>'fasta',-file=>$ARGV[0]."trimmed"); while (my $rec=$reader->next_seq) { ## works only on forward strand & 0 mismatches!!!! if ($seq->seq=~/(ATAGCCGGCACCCTGGT.*GGCCATATGAGTGGGCC)/i) { $rec->seq($1); $writer->write_seq($rec); } else { print STDERR "Could not find head and/or tail sequences for ",$seq->id,"\n"; } } |
![]() |
![]() |
![]() |
#3 |
Member
Location: USA Join Date: Apr 2012
Posts: 12
|
![]()
Thanks Krobison for the perl script. It gives an error-
Global symbol "$seq" requires explicit package name at ../../Scripts/trim_fastaseq.pl line 12. Global symbol "$seq" requires explicit package name at ../../Scripts/trim_fastaseq.pl line 19. Baika Last edited by baika; 03-04-2013 at 02:35 PM. Reason: wrong ID |
![]() |
![]() |
![]() |
#4 | |
Senior Member
Location: USA, Midwest Join Date: May 2008
Posts: 1,178
|
![]()
Sorry baika; I was going to answer until I saw this line in Keith's code:
Quote:
![]() |
|
![]() |
![]() |
![]() |
#5 |
Member
Location: Russia, Irkutsk Join Date: Feb 2011
Posts: 40
|
![]()
No fun, guys. Baika might be (and claims to be in introduction section) a non-bioniformatitian stuck with what is not his/her area of expertise. If you went to genomics/bioinformatics lab, you should at least learn some Perl/Python, but for now - seems like it should be $rec, not $seq under regexp (the scary thingy with // and lots of uppercase).
|
![]() |
![]() |
![]() |
#6 | |
Senior Member
Location: USA, Midwest Join Date: May 2008
Posts: 1,178
|
![]() Quote:
Code:
Change while (my $rec=$reader->next_seq) to while (my $seq=$reader->next_seq) |
|
![]() |
![]() |
![]() |
#7 |
Member
Location: Ireland Join Date: Mar 2012
Posts: 15
|
![]()
Not that it should make a difference, because it is unlikely to match in the middle of a sequence, but the the match operator should be bounded to the start and the end of the sequence read. But honestly I doubt it should make any difference at all
So this Code:
if ($seq->seq=~/(ATAGCCGGCACCCTGGT.*GGCCATATGAGTGGGCC)/i) { $rec->seq($1); $writer->write_seq($rec); } Code:
if ($seq->seq=~/(^ATAGCCGGCACCCTGGT.*GGCCATATGAGTGGGCC$)/i) { $rec->seq($1); $writer->write_seq($rec); } |
![]() |
![]() |
![]() |
#8 |
Member
Location: USA Join Date: Apr 2012
Posts: 12
|
![]()
Thanks krobison for writing this script, and kmcarr for pointing out the error. After incorporating all your suggestions and help from my friend Robert, finally it is working.
Thank you all baika Code:
#!/usr/bin/perl -w #Usage: trim_fasta.pl YOUR_FASTA_FILE.fasta OUT_FILE_TRIMMED.FASTA use strict; use Bio::SeqIO; my $reader=new Bio::SeqIO(-format=>'fasta',-file=>$ARGV[0]); my $writer=new Bio::SeqIO(-format=>'fasta',-file=>">$ARGV[1]"); while (my $seq=$reader->next_seq) { ## works only on forward strand & 0 mismatches!!!! if ($seq->seq=~/(CCAGTATTTGGTA.*AGTTGATAACTGGGAA)/i) { $seq->seq($1); $writer->write_seq($seq); } else { print STDERR "Could not find head and/or tail sequences for ",$seq->id,"\n"; } } Last edited by baika; 03-05-2013 at 11:04 AM. Reason: spelling mistake |
![]() |
![]() |
![]() |
Tags |
fasta trimming |
Thread Tools | |
|
|