SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
extract data from fasta-files with perl?? anna_ Bioinformatics 20 02-17-2016 08:29 AM
extract full fasta file for local blast hits Oyster Bioinformatics 8 02-16-2016 01:34 PM
perl : Remove redundant feature in fasta file StephaniePi83 Bioinformatics 9 12-15-2012 07:01 PM
Perl: get specific base from FASTA file. njh_TO Bioinformatics 6 02-02-2012 06:34 AM
Help with glimmer multi-extract sbberes Bioinformatics 2 03-19-2010 02:35 PM

Reply
 
Thread Tools
Old 02-16-2011, 09:28 AM   #1
andreitudor
Member
 
Location: Quebec

Join Date: Feb 2011
Posts: 21
Default Extract sequence from multi fasta file with PERL

Hello,

I want to make a script that will extract a sequence from a multifasta file. I have made a file with 19 genomes from NCBI and I want to be able to extract a genome that I need from it. I have played a little with Bio::SeqIO and learned how to create a object that has the hole file in it and then right the whole information in another file.
What i do not know how to do is how to select only the fasta sequence from the genome that I want and write it onto a file.
I think I have to create a third object that would contain only that genome, but I still don't know how to read that part from my sequence object...

Thanks in advance.
andreitudor is offline   Reply With Quote
Old 02-16-2011, 10:15 AM   #2
boetsie
Senior Member
 
Location: NL, Leiden

Join Date: Feb 2010
Posts: 245
Default

Hi,

I've attached a script which can do this. If i understand it correctly you have a file like;

>chr1
AGCTGATGATAGT...
>chr2
ACAAAATAGTCGAT....
>chr3
....

And your perl script would be something like;

perl extractSequence.pl genomefile.fa chr1

where 'chr1' corresponds to a sequence named chr1 (indicated by chr1)?

Say you have a more complicated file like;

>chr1_coverage1000_length100
AGATGTATGTTAGA

You can do something like;

perl extractSequence.pl genomefile.fa chr1_.

which will extract all the sequences containing the header chr1_

To store the results, do;

perl extractSequence.pl genomefile.fa chr1 > filename.txt

If this is what you want, you can use my script.

Boetsie
Attached Files
File Type: pl extractSequence.pl (1.0 KB, 2383 views)
boetsie is offline   Reply With Quote
Old 02-16-2011, 11:13 AM   #3
maasha
Senior Member
 
Location: Denmark

Join Date: Apr 2009
Posts: 153
Default

This can be done with Biopieces (www.biopieces.org) like this:

Code:
read_fasta -i ncbi_genomes.fna | grab -p my_genome | write_fasta -o my_genome.fna -x

Cheers


Martin
maasha is offline   Reply With Quote
Old 02-16-2011, 11:38 AM   #4
NM_010117
Member
 
Location: US

Join Date: Oct 2010
Posts: 14
Default

This script might be a little easier to wrap your head around (there's less code anyway) and does the same thing as boetsie's:

Code:
#!/usr/bin/perl -w
use strict;
my $file = shift;
my $pattern = shift;
my $data = `cat $file`;
my ($fa) = $data =~ /(>$pattern[^>]+)/s;
print $fa;
Mind you boetsie's implementation has the benefit of not having to read in the entire file every time (i.e. her implementation "slurps" in the file to avoid needing to use a large amount of memory). However, this of course isn't an issue with small files (e.g. bacterial genomes).
NM_010117 is offline   Reply With Quote
Old 02-16-2011, 12:14 PM   #5
ssully
Member
 
Location: NYC

Join Date: Aug 2010
Posts: 43
Default

another way (if you're already set up for local BLAST searches) -- use 'fastacmd' to extract whatever records you need from a fasta db you've created using 'formatdb'

(this refers to the old NCBI BLAST -- the newer NCBI 'blast+' uses different but analogous toolset:

http://www.ncbi.nlm.nih.gov/staff/ta.../pc_setup.html)
ssully is offline   Reply With Quote
Old 02-16-2011, 12:32 PM   #6
apc2010
Junior Member
 
Location: California

Join Date: Jan 2010
Posts: 1
Default

If you need sequences extracted from a multi-FASTA and are open to using a pre-existing tool, I would also suggest either the faSomeRecords or faOneRecord command line utilities from UCSC.

They have versions of this tool for OSX and Linux. Here is a link to the executable downloads:

http://hgdownload.cse.ucsc.edu/admin/exe/

The difference between the two: faOneRecord takes the sequence name to extract from the command line, faSomeRecords reads in a file of 1 or more sequence names to extract from the multi-FASTA.

Usage:
Code:
================================================================
========   faOneRecord   ====================================
================================================================
faOneRecord - Extract a single record from a .FA file
usage:
   faOneRecord in.fa recordName

================================================================
========   faSomeRecords   ====================================
================================================================
faSomeRecords - Extract multiple fa records
usage:
   faSomeRecords in.fa listFile out.fa
options:
   -exclude - output sequences not in the list file.
apc2010 is offline   Reply With Quote
Old 02-16-2011, 12:53 PM   #7
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,135
Default

There was a previous thread which explained how to do this with the BLAST tools (old and new)

http://seqanswers.com/forums/showthr...light=fastacmd

The problem with using Bio::SeqIO (or a simple perl script in general) is that there is no indexing of the file. This means that to pull out a specific sequence you have to scan through the file from the beginning until you find the sequence you are interested in. If your input file isn't too huge and you only need to do it for a small number of sequences or very infrequently that would work. But if the input file is ginormous (e.g. if your 19 genomes from NCBI are all big, mammalian genomes) and you want to repeatedly extract individual sequences then an indexed method like the BLAST tools is advisable.

Last edited by kmcarr; 02-16-2011 at 01:02 PM.
kmcarr is offline   Reply With Quote
Old 02-16-2011, 09:17 PM   #8
krobison
Senior Member
 
Location: Boston area

Join Date: Nov 2007
Posts: 747
Default

There's also Bio:B::Fasta module for Perl. One additional advantage to this tool is that you can easily get a specific range of a sequence as a subsequence.

I'm guilty of frequently doing this as a one-liner -- slow on a large database and a pain for a long list of sequences (or if you get the regex wrong), but sometimes it's easier than remembering a more specific tool.

Code:
perl -n -e '$on=(/^>(idA|idB|idC)/) if (/^>/); print $_ if ($on);' sequences.fa
krobison is offline   Reply With Quote
Old 02-17-2011, 08:01 AM   #9
andreitudor
Member
 
Location: Quebec

Join Date: Feb 2011
Posts: 21
Default

Hey guys,

Thanks for your answers. I do not want to use the tools provided from blast+ because I want to do it on my own. I looked into BIO:: DB:Fasta and I decided to use it, since it has a function that indexes your fasta file and then you can search by ID.
Now I have one problem, when I create the indexed file BIO:: DB:Fasta gives me this error :

indexing was interrupted, so unlinking reference.fna.index at /usr/share/perl5/Bio/DB/Fasta.pm line 1053.
Was not able to open files!!
Full error is


------------- EXCEPTION: Bio::Root::Exception -------------
MSG: Each line of the fasta entry must be the same length except the last.
Line above #46357 '
..' is 69 != 71 chars.
STACK: Error::throw
STACK: Bio::Root::Root::throw /usr/share/perl5/Bio/Root/Root.pm:368
STACK: Bio:B::Fasta::calculate_offsets /usr/share/perl5/Bio/DB/Fasta.pm:770
STACK: Bio:B::Fasta::index_file /usr/share/perl5/Bio/DB/Fasta.pm:680
STACK: Bio:B::Fasta::new /usr/share/perl5/Bio/DB/Fasta.pm:491
STACK: extract.pl:22
-----------------------------------------------------------

From this I understand that the lines in my fasta file are not equal. The only problem is that the line that is not the same length is the ID line. What can I do about this??

Andrei
andreitudor is offline   Reply With Quote
Old 02-17-2011, 08:38 AM   #10
krobison
Senior Member
 
Location: Boston area

Join Date: Nov 2007
Posts: 747
Default

Check the line lengths; the program is complaining that there is a sequence line of different length; it could well be some stray spaces; take a look at line #46357 with a text editor

This little program will find the numbers for all your "illegal" lines
[CODE]
#!/usr/bin/perl
use strict;
## find & print line number for all "illegal" FASTA lines
my %lengths=();
my $lastId=undef;
my $lineCount=0;
my $len=undef;
while ($_ = <>)
{
$lineCount++;
chomp;
if (/^>(\S+)/)
{
$lastId=$1;
if (defined $len) # last line of entry allowed to be ragged
{
delete $lengths{$len}->{$lineCount-1};
}
}
else
{
$len=length($_);
$lengths{$len}={} unless (ref $lengths{$len});
$lengths{$len}->{$lineCount}=$lastId;
}
}

my @lengths=sort {scalar(keys %{$lengths{$b}})<=>scalar(keys %{$lengths{$a}})} keys %lengths;
my $modalLength=shift(@lengths);
print "# modalLength=$modalLength ",scalar(keys %{$lengths{$modalLength}}),"\n";
foreach my $len(@lengths)
{
foreach my $lineCount(sort {$a<=>$b} keys %{$lengths{$len}})
{
print join("\t",$len,$lineCount,$lengths{$len}->{$lineCount}),"\n";
}
}
[/CODE
krobison is offline   Reply With Quote
Old 08-05-2011, 09:43 AM   #11
mghita
Member
 
Location: Cambridge

Join Date: Aug 2011
Posts: 10
Default Hi

Hi,

I might sound dumb to people around here , but how do I install/open/use the command? I tried faSomeRecords in.fa listFile out.fa with my files and it says "-bash: faSomeRecords: command not found", I added it to my path too. I'm new in the area and have no clue

Madalina



Quote:
Originally Posted by apc2010 View Post
If you need sequences extracted from a multi-FASTA and are open to using a pre-existing tool, I would also suggest either the faSomeRecords or faOneRecord command line utilities from UCSC.

They have versions of this tool for OSX and Linux. Here is a link to the executable downloads:

http://hgdownload.cse.ucsc.edu/admin/exe/

The difference between the two: faOneRecord takes the sequence name to extract from the command line, faSomeRecords reads in a file of 1 or more sequence names to extract from the multi-FASTA.

Usage:
Code:
================================================================
========   faOneRecord   ====================================
================================================================
faOneRecord - Extract a single record from a .FA file
usage:
   faOneRecord in.fa recordName

================================================================
========   faSomeRecords   ====================================
================================================================
faSomeRecords - Extract multiple fa records
usage:
   faSomeRecords in.fa listFile out.fa
options:
   -exclude - output sequences not in the list file.
mghita is offline   Reply With Quote
Old 08-08-2011, 08:55 AM   #12
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,583
Default

Quote:
Originally Posted by mghita View Post
Hi,

I might sound dumb to people around here , but how do I install/open/use the command? I tried faSomeRecords in.fa listFile out.fa with my files and it says "-bash: faSomeRecords: command not found", I added it to my path too. I'm new in the area and have no clue

Madalina
Madalina,

I will assume that you have downloaded the correct version of the compiled program for your operating system and saved it in a folder (or home dir).

You may need to make this file executable by adding the correct unix permission: chmod u+x /path_to/faSomeRecords (if you are in the same directory where this program is then use chmod u+x ./faSomeRecords).

Now you should be able to execute the file by following the example form earlier post.
e.g. /path_to/faSomeRecords your_data_file
GenoMax is offline   Reply With Quote
Old 08-08-2011, 09:20 AM   #13
mghita
Member
 
Location: Cambridge

Join Date: Aug 2011
Posts: 10
Default

I have added the program to my path and I set the permission right, but now I have another issue:
"You need the Rosetta software to run faSomeRecords. The Rosetta installer is in Optional Installs on your Mac OS X installation disc."

and I don't have Rosetta installed, or the CD for installation, so I don't know how to handle this problem. Any suggestions?


Thanks,
Madalina


Quote:
Originally Posted by GenoMax View Post
Madalina,

I will assume that you have downloaded the correct version of the compiled program for your operating system and saved it in a folder (or home dir).

You may need to make this file executable by adding the correct unix permission: chmod u+x /path_to/faSomeRecords (if you are in the same directory where this program is then use chmod u+x ./faSomeRecords).

Now you should be able to execute the file by following the example form earlier post.
e.g. /path_to/faSomeRecords your_data_file
mghita is offline   Reply With Quote
Old 08-08-2011, 09:40 AM   #14
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,583
Default

Quote:
Originally Posted by mghita View Post
I have added the program to my path and I set the permission right, but now I have another issue:
"You need the Rosetta software to run faSomeRecords. The Rosetta installer is in Optional Installs on your Mac OS X installation disc."

and I don't have Rosetta installed, or the CD for installation, so I don't know how to handle this problem. Any suggestions?


Thanks,
Madalina
Madalina,

If you are connected to the internet you should automatically be offered the option to download rosetta and install it.

Do you have a PowerPC- or an intel-based Mac? What OS are you running?
GenoMax is offline   Reply With Quote
Old 08-08-2011, 09:46 AM   #15
mghita
Member
 
Location: Cambridge

Join Date: Aug 2011
Posts: 10
Default

Quote:
Originally Posted by GenoMax View Post
Madalina,

If you are connected to the internet you should automatically be offered the option to download rosetta and install it.

Do you have a PowerPC- or an intel-based Mac? What OS are you running?


I have Mac OS X 10.6.8, 3.06 GHz. I just get that message in bash, I don't get any install option. I tried to download it, but it doesn't work.
mghita is offline   Reply With Quote
Old 08-08-2011, 01:18 PM   #16
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,135
Default

Quote:
Originally Posted by mghita View Post
I have added the program to my path and I set the permission right, but now I have another issue:
"You need the Rosetta software to run faSomeRecords. The Rosetta installer is in Optional Installs on your Mac OS X installation disc."

and I don't have Rosetta installed, or the CD for installation, so I don't know how to handle this problem. Any suggestions?


Thanks,
Madalina
Quote:
Originally Posted by GenoMax View Post
Madalina,

If you are connected to the internet you should automatically be offered the option to download rosetta and install it.

Do you have a PowerPC- or an intel-based Mac? What OS are you running?
Quote:
Originally Posted by mghita View Post
I have Mac OS X 10.6.8, 3.06 GHz. I just get that message in bash, I don't get any install option. I tried to download it, but it doesn't work.
Madalina,

Your Mac has an Intel CPU but the version of faSomeRecords which you are trying to run is compiled for PowerPC based Macs. You could try to intall Rosetta (Rosetta is a compatibility layer which allows PPC code to run on Intel Macs) but the easier course of action would be to install a proper version of the binary for your computer.

If you go back to the download site (http://hgdownload.cse.ucsc.edu/admin/exe/) you will see that there are two directories for macOSX software, one for PowerPC (macOSX.ppc) and one for Intel (macOSX.i386). Make sure to download and install the program from the macOSX.i386 directory.
kmcarr is offline   Reply With Quote
Old 08-09-2011, 01:28 AM   #17
mghita
Member
 
Location: Cambridge

Join Date: Aug 2011
Posts: 10
Default

Hi,

Yes, that seems to work, but the command itself doesn't. The reads in my fasta file (file.fas) are named @Frag_1, @Frag_2 .... @Frag_20000. I want to extract some of them - I have their names in a text file (diff.txt) saved like this

@Frag_93
@Frag_530
@Frag_2183
@Frag_3988
@Frag_7733

I used:

faSomeRecord file.fas diff.txt output.fas

and output.fas is empty. Any idea why this happens?


Thanks
Madalina
mghita is offline   Reply With Quote
Old 08-09-2011, 05:34 AM   #18
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,583
Default

Quote:
Originally Posted by mghita View Post
Hi,

Yes, that seems to work, but the command itself doesn't. The reads in my fasta file (file.fas) are named @Frag_1, @Frag_2 .... @Frag_20000. I want to extract some of them - I have their names in a text file (diff.txt) saved like this

@Frag_93
@Frag_530
@Frag_2183
@Frag_3988
@Frag_7733

I used:

faSomeRecord file.fas diff.txt output.fas

and output.fas is empty. Any idea why this happens?


Thanks
Madalina
NOTE: Please use new names for the files as shown below on the command lines. This would preserve your original files as they are.

Madalina,

The program is expecting the fasta identifiers to start with ">" rather than "@". You can do the replacement with a program called "sed" that should be there in MacOS (do not have a Mac handy to check that out).

Do this on the command line (note single quotes):

sed 's/@/>/g' original_fasta_file > new_file.fas

The "new_file.fas" should have all "@" replaced by ">".

Remember you need fasta id's (without the ">") in the file you supply for extraction. You can use the same "sed" program to strip the "@" signs from your fasta identifiers like this,

sed 's/@//g' diff.txt new_diff.txt

Now you can use the two new files you created to get the output.

faSomeRecord new_file.fas new_diff.txt output.fas

Last edited by GenoMax; 08-09-2011 at 05:36 AM. Reason: adding_info_to_clarify
GenoMax is offline   Reply With Quote
Old 08-09-2011, 05:58 AM   #19
mghita
Member
 
Location: Cambridge

Join Date: Aug 2011
Posts: 10
Default

I have given up. I replaced the @ with > and still didn't work. I have combined a little awk and R and does my job just fine. Thanks a lot for the effort!

Madalina
mghita is offline   Reply With Quote
Old 08-22-2011, 02:22 PM   #20
scopak
Junior Member
 
Location: Ohio

Join Date: Dec 2009
Posts: 1
Default

krobison, I too like Perl one-liners.

In the example below, sed bookends are used to add and remove blank lines for the regex search.

sed 's/^>.*/\n&/' <in.fasta | perl -e ' while(<>){ print if(/^>chr1/.../^\n/); }' | sed '/^$/d' >patterns.fasta

Sed is used to add a blank line above each fasta record beginning with '>.*' in the file in.fasta. The stdout is then piped to a Perl range finder that searches for lines that begin with >chr1 and all sequence lines to the next blank line (^\n).
Finally, blank lines are removed with sed and the matching records are saved to the outfile, patterns.fasta.

Hope that helps
scopak is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:29 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO