Seqanswers Leaderboard Ad

**Monika_bioinf** · 02-07-2014, 12:25 PM

Use seqtk subseq, https://github.com/lh3/seqtk

**GenoMax** · 02-07-2014, 12:34 PM

Just in case subseq does not work with fasta files here are other alternatives:

Extracting Multiple Fasta Sequences At A Time From A File Containing Many Sequences

http://www.biostars.org/p/2822/

A perl one liner:

http://edwards.sdsu.edu/labsite/index.php/robert/381-perl-one-liner-to-extract-sequences-by-their-identifer-from-a-fasta-file

**JackieBadger** · 02-07-2014, 12:44 PM

Out of curiosity, GenoMax, how does second perl function handle the searching of the sample name file?
I wrote something similar under linux bash and it went bonkers with the RAM. I think the issue was that when ever it found an ID in the fasta file, it would then not remove this ID from the file containing the query IDs, and then start the search again from the start. Either way, my scripting abilities produced something that seemed unfeasible to execute on a large query ID file and large fasta file.

Cheers,

J

**lac302** · 02-07-2014, 12:54 PM

Thanks all!

**JohnN** · 02-07-2014, 03:47 PM

I wrote this a few years back.

Google Code Archive - Long-term storage for Google Code Project Hosting.

https://code.google.com/p/nash-bioinformatics-codelets/downloads/detail?name=subset_fasta.pl&can=2&q=

**Richard Finney** · 02-07-2014, 07:31 PM

awk 'BEGIN{while((getline x<ARGV[1])>0){a[i++]=x;}while((getline y<ARGV[2])>0){if(substr(y,0,1)==">"){m=0;for(j=0;j<i;j++){if(y==a[j])m=1;}}if(m==1)print y;}}' $1 $2

$1 is match file
$2 is fasta file

**GenoMax** · 02-08-2014, 05:55 PM

Originally posted by JackieBadger View Post

Out of curiosity, GenoMax, how does second perl function handle the searching of the sample name file?
I wrote something similar under linux bash and it went bonkers with the RAM. I think the issue was that when ever it found an ID in the fasta file, it would then not remove this ID from the file containing the query IDs, and then start the search again from the start. Either way, my scripting abilities produced something that seemed unfeasible to execute on a large query ID file and large fasta file.

Cheers,

J

@JackieBadger: Second perl function is using -n and -e switches. -n wraps a while loop around the program while -p feeds the program value of $_ each time.

A nice example that illustrates this (equivalent to unix 'cat' command)

Code:

$ perl -ne 'print $_' filename

or

Code:

$ perl -ne 'print' filename

**Birdman** · 02-09-2014, 11:48 AM

This little BioPython script will nicely do the job:

Code:

from Bio import SeqIO
import sys

#Usage: filter_fasta_per_ids.py input.fasta filter_ids.txt output.fasta

input_file =sys.argv[1]
id_file =sys.argv[2]
output_file =sys.argv[3]
wanted = set(line.rstrip("\n").split(None,1)[0] for line in open(id_file))
print("Found %i unique identifiers in %s" % (len(wanted), id_file))
records = (r for r in SeqIO.parse(input_file, "fasta") if r.id in wanted)
count = SeqIO.write(records, output_file, "fasta")
print("Saved %i records from %s to %s" % (count, input_file, output_file))
if count < len(wanted):
    print("Warning %i IDs not found in %s" % (len(wanted)-count, input_file))

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 14 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

looking for a simple script to pull a subset of contigs from an assembly

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News