Here is a script I wrote a while back to
almost do what you want. It takes as input a FASTA file, a text file with a list of sequence IDs (one per line) and a mode argument to include or exclude the IDs in your list from the output. You could simply run the script twice, once in each mode to get the two complementary outputs, or if you feel like it modify the code to generate two output files. As it works now output is written to STDOUT so you can only capture one output by redirecting STDOUT to a file.
Code:
Usage:
% subSetFasta.pl -f <fastaFileName> -l <listFileName> -m [i or e]
Example:
% subSetFasta.pl -f mySeqs.fasta -l myList.txt -m i > inList.fasta
% subSetFasta.pl -f mySeqs.fasta -l myList.txt -m e > notInList.fasta
If you do not specify a -mode argument the script defaults to the 'include' mode.
A note about ID matching: the script bases a match on the
first non-white space delimited text on the defline. If your defline is:
Code:
>sequenceID sequence description follows
The script will only attempt to match 'sequenceID' so make sure that is the text in list file.