SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
change order of FASTA seqs, based on ID list ssully General 2 04-23-2013 01:22 AM
Split a SAM file rahul Bioinformatics 6 12-20-2011 12:12 PM
split a fastq file lfaino Bioinformatics 4 04-14-2011 04:28 PM
Split fastq to fasta and qual file? ewilbanks Bioinformatics 8 01-07-2011 03:02 AM
Split GA FASTQ file aritakum Bioinformatics 3 06-10-2010 05:15 AM

Reply
 
Thread Tools
Old 08-13-2013, 09:27 AM   #1
lran2008
Member
 
Location: quebec

Join Date: Apr 2013
Posts: 35
Default how to split a fasta file according to a list of gene ID

Hi ALL,

I have a fasta file and I want to split it in two two fasta files according to a list of sequence names in a text file (one seq name per line). So those seqs which have a match with the sequences names can be output to one fasta file and the others in another file.

Could anybody provide me a script or some programs to perform this work? There are some online tools, but it would take a large amount of time to upload my file.

Thanks.
lran2008 is offline   Reply With Quote
Old 08-13-2013, 09:53 AM   #2
rhinoceros
Senior Member
 
Location: sub-surface moon base

Join Date: Apr 2013
Posts: 372
Default

If your sequences aren't split to multiple lines you can do this with grep. I think:

grep -A 1 -f yourSeqIDFile.txt yourFastaFile.fasta > SeqsFromIDList.fasta
grep -A 1 -v -f yourSeqIDFile.txt yourFastaFile.fasta > TheOtherSeqs.fasta

might remember wrong..


If you have QIIME, you can do this with filter_fasta.py..

Last edited by rhinoceros; 08-13-2013 at 09:56 AM.
rhinoceros is offline   Reply With Quote
Old 08-13-2013, 11:14 AM   #3
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,178
Default

Here is a script I wrote a while back to almost do what you want. It takes as input a FASTA file, a text file with a list of sequence IDs (one per line) and a mode argument to include or exclude the IDs in your list from the output. You could simply run the script twice, once in each mode to get the two complementary outputs, or if you feel like it modify the code to generate two output files. As it works now output is written to STDOUT so you can only capture one output by redirecting STDOUT to a file.

Code:
Usage:

% subSetFasta.pl -f <fastaFileName> -l <listFileName> -m [i or e]

Example:

% subSetFasta.pl -f mySeqs.fasta -l myList.txt -m i > inList.fasta
% subSetFasta.pl -f mySeqs.fasta -l myList.txt -m e > notInList.fasta
If you do not specify a -mode argument the script defaults to the 'include' mode.

A note about ID matching: the script bases a match on the first non-white space delimited text on the defline. If your defline is:

Code:
>sequenceID sequence description follows
The script will only attempt to match 'sequenceID' so make sure that is the text in list file.
Attached Files
File Type: pl subSetFasta.pl (2.4 KB, 60 views)

Last edited by kmcarr; 08-13-2013 at 11:16 AM. Reason: Add note about default mode.
kmcarr is offline   Reply With Quote
Old 08-13-2013, 11:16 AM   #4
JohnN
Member
 
Location: Toronto

Join Date: Jan 2011
Posts: 30
Default

Quote:
Originally Posted by lran2008 View Post
Hi ALL,

I have a fasta file and I want to split it in two two fasta files according to a list of sequence names in a text file (one seq name per line). So those seqs which have a match with the sequences names can be output to one fasta file and the others in another file.

Could anybody provide me a script or some programs to perform this work? There are some online tools, but it would take a large amount of time to upload my file.

Thanks.
Try this: https://code.google.com/p/nash-bioin...ta.pl&can=2&q=

Hopefully it will do the job you need.

J

Last edited by JohnN; 08-13-2013 at 11:19 AM. Reason: Wrong URL
JohnN is offline   Reply With Quote
Old 08-13-2013, 01:43 PM   #5
lran2008
Member
 
Location: quebec

Join Date: Apr 2013
Posts: 35
Default

Quote:
Originally Posted by rhinoceros View Post
If your sequences aren't split to multiple lines you can do this with grep. I think:

grep -A 1 -f yourSeqIDFile.txt yourFastaFile.fasta > SeqsFromIDList.fasta
grep -A 1 -v -f yourSeqIDFile.txt yourFastaFile.fasta > TheOtherSeqs.fasta

might remember wrong..


If you have QIIME, you can do this with filter_fasta.py..
Thanks. The second command didn't work.
lran2008 is offline   Reply With Quote
Old 08-13-2013, 01:49 PM   #6
lran2008
Member
 
Location: quebec

Join Date: Apr 2013
Posts: 35
Default

Quote:
Originally Posted by kmcarr View Post
Here is a script I wrote a while back to almost do what you want. It takes as input a FASTA file, a text file with a list of sequence IDs (one per line) and a mode argument to include or exclude the IDs in your list from the output. You could simply run the script twice, once in each mode to get the two complementary outputs, or if you feel like it modify the code to generate two output files. As it works now output is written to STDOUT so you can only capture one output by redirecting STDOUT to a file.

Code:
Usage:

% subSetFasta.pl -f <fastaFileName> -l <listFileName> -m [i or e]

Example:

% subSetFasta.pl -f mySeqs.fasta -l myList.txt -m i > inList.fasta
% subSetFasta.pl -f mySeqs.fasta -l myList.txt -m e > notInList.fasta
If you do not specify a -mode argument the script defaults to the 'include' mode.

A note about ID matching: the script bases a match on the first non-white space delimited text on the defline. If your defline is:

Code:
>sequenceID sequence description follows
The script will only attempt to match 'sequenceID' so make sure that is the text in list file.
Thanks very much. The script works perfectly!
lran2008 is offline   Reply With Quote
Old 08-14-2013, 04:00 AM   #7
JamieHeather
@jamimmunology
 
Location: London

Join Date: Nov 2012
Posts: 96
Default

In case anyone needed more alternatives, you can also use fastq_select.tcl which is bundled in with mira. This also got discussed in an earlier thread, which might be useful.
JamieHeather is offline   Reply With Quote
Old 08-14-2013, 07:10 AM   #8
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

If you want a Galaxy solution, try this:
http://toolshed.g2.bx.psu.edu/view/p...q_filter_by_id

Or this related but subtly different tool which pulls out the reads in the ID order given
http://toolshed.g2.bx.psu.edu/view/p...q_select_by_id
maubp is offline   Reply With Quote
Old 08-14-2013, 10:21 AM   #9
lran2008
Member
 
Location: quebec

Join Date: Apr 2013
Posts: 35
Default

Quote:
Originally Posted by maubp View Post
If you want a Galaxy solution, try this:
http://toolshed.g2.bx.psu.edu/view/p...q_filter_by_id

Or this related but subtly different tool which pulls out the reads in the ID order given
http://toolshed.g2.bx.psu.edu/view/p...q_select_by_id
This should work. I didn't try it,so I don't know whether it can output a fasta file for unmatched seq.
lran2008 is offline   Reply With Quote
Old 08-15-2013, 01:54 AM   #10
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

Quote:
Originally Posted by lran2008 View Post
This should work. I didn't try it,so I don't know whether it can output a fasta file for unmatched seq.
Yes, my sequence filter tool can produce a FASTA file with matched IDs, a FASTA file with non-matching IDs, or both (two FASTA files):
http://toolshed.g2.bx.psu.edu/view/p...q_filter_by_id

There is a preview/mockup of the tool available to view within the Tool Shed which should help explain this.
maubp is offline   Reply With Quote
Reply

Tags
fasta file

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:00 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO