SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Simple scaffolding problem jmartin Bioinformatics 3 01-09-2014 04:32 AM
Simple stats question jvanleuven Bioinformatics 4 05-14-2012 06:53 AM
Simple Bowtie question nilmot13 Bioinformatics 5 03-07-2012 02:03 AM
simple questions rdu Bioinformatics 2 01-06-2011 11:20 AM

Reply
 
Thread Tools
Old 02-07-2014, 11:03 AM   #1
lac302
Member
 
Location: DE

Join Date: Dec 2012
Posts: 65
Default looking for a simple script to pull a subset of contigs from an assembly

i'm sure this is a simple enough task but i'm just an end user no scripting experience at all.

looking for a script to pull contigs listed in a .txt file from assembly.fa and output the results to a new .fa file

any help would be much appreciated. thanks.
lac302 is offline   Reply With Quote
Old 02-07-2014, 11:25 AM   #2
Monika_bioinf
Junior Member
 
Location: Pennsylvania, US

Join Date: Sep 2011
Posts: 7
Default

Use seqtk subseq, https://github.com/lh3/seqtk
Monika_bioinf is offline   Reply With Quote
Old 02-07-2014, 11:34 AM   #3
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,075
Default

Just in case subseq does not work with fasta files here are other alternatives:

http://www.biostars.org/p/2822/

A perl one liner:

http://edwards.sdsu.edu/labsite/inde...m-a-fasta-file
GenoMax is offline   Reply With Quote
Old 02-07-2014, 11:44 AM   #4
JackieBadger
Senior Member
 
Location: Halifax, Nova Scotia

Join Date: Mar 2009
Posts: 381
Default

Out of curiosity, GenoMax, how does second perl function handle the searching of the sample name file?
I wrote something similar under linux bash and it went bonkers with the RAM. I think the issue was that when ever it found an ID in the fasta file, it would then not remove this ID from the file containing the query IDs, and then start the search again from the start. Either way, my scripting abilities produced something that seemed unfeasible to execute on a large query ID file and large fasta file.

Cheers,

J
JackieBadger is offline   Reply With Quote
Old 02-07-2014, 11:54 AM   #5
lac302
Member
 
Location: DE

Join Date: Dec 2012
Posts: 65
Default

Thanks all!
lac302 is offline   Reply With Quote
Old 02-07-2014, 02:47 PM   #6
JohnN
Member
 
Location: Toronto

Join Date: Jan 2011
Posts: 30
Default

I wrote this a few years back.

https://code.google.com/p/nash-bioin...ta.pl&can=2&q=
JohnN is offline   Reply With Quote
Old 02-07-2014, 06:31 PM   #7
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 700
Default

awk 'BEGIN{while((getline x<ARGV[1])>0){a[i++]=x;}while((getline y<ARGV[2])>0){if(substr(y,0,1)==">"){m=0;for(j=0;j<i;j++){if(y==a[j])m=1;}}if(m==1)print y;}}' $1 $2


$1 is match file
$2 is fasta file
Richard Finney is offline   Reply With Quote
Old 02-08-2014, 04:55 PM   #8
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,075
Default

Quote:
Originally Posted by JackieBadger View Post
Out of curiosity, GenoMax, how does second perl function handle the searching of the sample name file?
I wrote something similar under linux bash and it went bonkers with the RAM. I think the issue was that when ever it found an ID in the fasta file, it would then not remove this ID from the file containing the query IDs, and then start the search again from the start. Either way, my scripting abilities produced something that seemed unfeasible to execute on a large query ID file and large fasta file.

Cheers,

J
@JackieBadger: Second perl function is using -n and -e switches. -n wraps a while loop around the program while -p feeds the program value of $_ each time.

A nice example that illustrates this (equivalent to unix 'cat' command)

Code:
$ perl -ne 'print $_' filename
or
Code:
$ perl -ne 'print' filename

Last edited by GenoMax; 02-08-2014 at 05:05 PM.
GenoMax is offline   Reply With Quote
Old 02-09-2014, 10:48 AM   #9
Birdman
Member
 
Location: Montreal

Join Date: Jan 2014
Posts: 21
Default

This little BioPython script will nicely do the job:

Code:
from Bio import SeqIO
import sys

#Usage: filter_fasta_per_ids.py input.fasta filter_ids.txt output.fasta

input_file =sys.argv[1]
id_file =sys.argv[2]
output_file =sys.argv[3]
wanted = set(line.rstrip("\n").split(None,1)[0] for line in open(id_file))
print("Found %i unique identifiers in %s" % (len(wanted), id_file))
records = (r for r in SeqIO.parse(input_file, "fasta") if r.id in wanted)
count = SeqIO.write(records, output_file, "fasta")
print("Saved %i records from %s to %s" % (count, input_file, output_file))
if count < len(wanted):
    print("Warning %i IDs not found in %s" % (len(wanted)-count, input_file))
Birdman is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:55 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO