SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   Bioinformatics (http://seqanswers.com/forums/forumdisplay.php?f=18)
-   -   how to split a fasta file according to a list of gene ID (http://seqanswers.com/forums/showthread.php?t=32750)

lran2008 08-13-2013 08:27 AM

how to split a fasta file according to a list of gene ID
 
Hi ALL,

I have a fasta file and I want to split it in two two fasta files according to a list of sequence names in a text file (one seq name per line). So those seqs which have a match with the sequences names can be output to one fasta file and the others in another file.

Could anybody provide me a script or some programs to perform this work? There are some online tools, but it would take a large amount of time to upload my file.

Thanks.

rhinoceros 08-13-2013 08:53 AM

If your sequences aren't split to multiple lines you can do this with grep. I think:

grep -A 1 -f yourSeqIDFile.txt yourFastaFile.fasta > SeqsFromIDList.fasta
grep -A 1 -v -f yourSeqIDFile.txt yourFastaFile.fasta > TheOtherSeqs.fasta

might remember wrong..


If you have QIIME, you can do this with filter_fasta.py..

kmcarr 08-13-2013 10:14 AM

1 Attachment(s)
Here is a script I wrote a while back to almost do what you want. It takes as input a FASTA file, a text file with a list of sequence IDs (one per line) and a mode argument to include or exclude the IDs in your list from the output. You could simply run the script twice, once in each mode to get the two complementary outputs, or if you feel like it modify the code to generate two output files. As it works now output is written to STDOUT so you can only capture one output by redirecting STDOUT to a file.

Code:

Usage:

% subSetFasta.pl -f <fastaFileName> -l <listFileName> -m [i or e]

Example:

% subSetFasta.pl -f mySeqs.fasta -l myList.txt -m i > inList.fasta
% subSetFasta.pl -f mySeqs.fasta -l myList.txt -m e > notInList.fasta

If you do not specify a -mode argument the script defaults to the 'include' mode.

A note about ID matching: the script bases a match on the first non-white space delimited text on the defline. If your defline is:

Code:

>sequenceID sequence description follows
The script will only attempt to match 'sequenceID' so make sure that is the text in list file.

JohnN 08-13-2013 10:16 AM

Quote:

Originally Posted by lran2008 (Post 113367)
Hi ALL,

I have a fasta file and I want to split it in two two fasta files according to a list of sequence names in a text file (one seq name per line). So those seqs which have a match with the sequences names can be output to one fasta file and the others in another file.

Could anybody provide me a script or some programs to perform this work? There are some online tools, but it would take a large amount of time to upload my file.

Thanks.

Try this: https://code.google.com/p/nash-bioin...ta.pl&can=2&q=

Hopefully it will do the job you need.

J

lran2008 08-13-2013 12:43 PM

Quote:

Originally Posted by rhinoceros (Post 113370)
If your sequences aren't split to multiple lines you can do this with grep. I think:

grep -A 1 -f yourSeqIDFile.txt yourFastaFile.fasta > SeqsFromIDList.fasta
grep -A 1 -v -f yourSeqIDFile.txt yourFastaFile.fasta > TheOtherSeqs.fasta

might remember wrong..


If you have QIIME, you can do this with filter_fasta.py..

Thanks. The second command didn't work.

lran2008 08-13-2013 12:49 PM

Quote:

Originally Posted by kmcarr (Post 113387)
Here is a script I wrote a while back to almost do what you want. It takes as input a FASTA file, a text file with a list of sequence IDs (one per line) and a mode argument to include or exclude the IDs in your list from the output. You could simply run the script twice, once in each mode to get the two complementary outputs, or if you feel like it modify the code to generate two output files. As it works now output is written to STDOUT so you can only capture one output by redirecting STDOUT to a file.

Code:

Usage:

% subSetFasta.pl -f <fastaFileName> -l <listFileName> -m [i or e]

Example:

% subSetFasta.pl -f mySeqs.fasta -l myList.txt -m i > inList.fasta
% subSetFasta.pl -f mySeqs.fasta -l myList.txt -m e > notInList.fasta

If you do not specify a -mode argument the script defaults to the 'include' mode.

A note about ID matching: the script bases a match on the first non-white space delimited text on the defline. If your defline is:

Code:

>sequenceID sequence description follows
The script will only attempt to match 'sequenceID' so make sure that is the text in list file.

Thanks very much. The script works perfectly!

JamieHeather 08-14-2013 03:00 AM

In case anyone needed more alternatives, you can also use fastq_select.tcl which is bundled in with mira. This also got discussed in an earlier thread, which might be useful.

maubp 08-14-2013 06:10 AM

If you want a Galaxy solution, try this:
http://toolshed.g2.bx.psu.edu/view/p...q_filter_by_id

Or this related but subtly different tool which pulls out the reads in the ID order given
http://toolshed.g2.bx.psu.edu/view/p...q_select_by_id

lran2008 08-14-2013 09:21 AM

Quote:

Originally Posted by maubp (Post 113454)
If you want a Galaxy solution, try this:
http://toolshed.g2.bx.psu.edu/view/p...q_filter_by_id

Or this related but subtly different tool which pulls out the reads in the ID order given
http://toolshed.g2.bx.psu.edu/view/p...q_select_by_id

This should work. I didn't try it,so I don't know whether it can output a fasta file for unmatched seq.

maubp 08-15-2013 12:54 AM

Quote:

Originally Posted by lran2008 (Post 113483)
This should work. I didn't try it,so I don't know whether it can output a fasta file for unmatched seq.

Yes, my sequence filter tool can produce a FASTA file with matched IDs, a FASTA file with non-matching IDs, or both (two FASTA files):
http://toolshed.g2.bx.psu.edu/view/p...q_filter_by_id

There is a preview/mockup of the tool available to view within the Tool Shed which should help explain this.


All times are GMT -8. The time now is 09:25 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.