View Single Post
Old 08-13-2013, 10:14 AM   #3
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,178
Default

Here is a script I wrote a while back to almost do what you want. It takes as input a FASTA file, a text file with a list of sequence IDs (one per line) and a mode argument to include or exclude the IDs in your list from the output. You could simply run the script twice, once in each mode to get the two complementary outputs, or if you feel like it modify the code to generate two output files. As it works now output is written to STDOUT so you can only capture one output by redirecting STDOUT to a file.

Code:
Usage:

% subSetFasta.pl -f <fastaFileName> -l <listFileName> -m [i or e]

Example:

% subSetFasta.pl -f mySeqs.fasta -l myList.txt -m i > inList.fasta
% subSetFasta.pl -f mySeqs.fasta -l myList.txt -m e > notInList.fasta
If you do not specify a -mode argument the script defaults to the 'include' mode.

A note about ID matching: the script bases a match on the first non-white space delimited text on the defline. If your defline is:

Code:
>sequenceID sequence description follows
The script will only attempt to match 'sequenceID' so make sure that is the text in list file.
Attached Files
File Type: pl subSetFasta.pl (2.4 KB, 60 views)

Last edited by kmcarr; 08-13-2013 at 10:16 AM. Reason: Add note about default mode.
kmcarr is offline   Reply With Quote