SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Extract partial reads sequences from mapped region qinl Bioinformatics 0 04-05-2016 02:06 PM
Extract multiple fasta sequences from a fasta file based on sequenes entomology Bioinformatics 38 12-19-2015 07:28 PM
Extract gene sequences from gff3 file and reference fasta JonB Bioinformatics 1 07-15-2014 01:13 AM
How to use coordinates in order to extract sequences in FASTA file? prs321 Bioinformatics 1 09-14-2013 10:07 AM
Extract partial sequence from FASTA record cdlam Bioinformatics 9 10-30-2012 03:21 PM

Reply
 
Thread Tools
Old 04-24-2017, 05:55 AM   #1
SeqTroubles
Member
 
Location: Ireland

Join Date: Sep 2016
Posts: 20
Default extract fasta sequences from multifasta file using partial or gene names

Hi All,

I am working PROKKA v1.12 files. I have a list of gene names such as

sacX
arcB
metB
sprT
adrB_2
fadD

and my fasta file is like so

>BOKHJPML_00001 hypothetical protein
ATGC
>BOKHJPML_00002 hypothetical protein
ATGC
>BOKHJPML_00003 Protease HtpX
ATGC
>BOKHJPML_00006 ATP-dependent Clp protease ATP-binding subunit ClpC
ATGC
BOKHJPML_00016 Inner membrane protein YfdC
ATGC

I want to extract the fasta sequences from the list. I have tried following previous suggestions using faidhttps://www.biostars.org/p/126204/x and biopyhttps://www.biostars.org/p/2822/thon
With no success. This faidx example is the closest I have come to success but I get a string of errors

warning: sacX not found in file
warning: arcB not found in file
warning: metB not found in file
warning: sprT not found in file
warning: adrB_2 not found in file
warning: fadD not found in file

Thanks in advance
SeqTroubles is offline   Reply With Quote
Old 04-24-2017, 09:08 AM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,854
Default

One way would be to extract the full read headers from the sequence file using your ID's.
Code:
for i in `cat ./id_file `; do grep -i $i sequence.fa >> ID_in_sequence_file;done
Then use one of the methods you have found or faSomeRecords utility from Jim Kent to get the sequences extracted.
GenoMax is offline   Reply With Quote
Old 04-24-2017, 11:13 AM   #3
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

You can also use BBMap's filterbyname.sh tool, particularly if you have a long list of names:

Code:
filterbyname.sh in=file.fa out=filtered.fa include=t names=names.txt substring
The "substring" flag allows partial matches.
Brian Bushnell is offline   Reply With Quote
Old 04-25-2017, 01:33 AM   #4
SeqTroubles
Member
 
Location: Ireland

Join Date: Sep 2016
Posts: 20
Default

Quote:
Originally Posted by Brian Bushnell View Post
You can also use BBMap's filterbyname.sh tool, particularly if you have a long list of names:

Code:
filterbyname.sh in=file.fa out=filtered.fa include=t names=names.txt substring
The "substring" flag allows partial matches.
Thanks so much Brian, this is very straight forward. Just a quick question about the tool. My gene name list contains some ambiguous names such as group_XXXX, as it is an output of roary. Would setting substring=t or substring=names cause it to partially match fasta headers from prokka output via the locus tag? If so is there a way to prevent this I have been using the following command:

Code:
filterbyname.sh in=seqs.ffn out=test.fasta include=t names=list.txt substring=name casesensitive=f
There are some seqs on my output which I feel should not be present. Although I do think the casesensitive flag is most likely the culprit?
Thanks.
SeqTroubles is offline   Reply With Quote
Old 04-25-2017, 01:04 PM   #5
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

"substring=names" will consider a sequence to be a match if the sequence name contains any line in list.txt as a substring; and in this case, it's ignoring case. I suggest not ignoring case unless it's essential. Note that if you have any really short names in your file, like "A", it might match just about everything...
Brian Bushnell is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:48 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO