Seqanswers Leaderboard Ad

**GenoMax** · 04-24-2017, 08:08 AM

One way would be to extract the full read headers from the sequence file using your ID's.

Code:

for i in `cat ./id_file `; do grep -i $i sequence.fa >> ID_in_sequence_file;done

Then use one of the methods you have found or faSomeRecords utility from Jim Kent to get the sequences extracted.

**Brian Bushnell** · 04-24-2017, 10:13 AM

You can also use BBMap's filterbyname.sh tool, particularly if you have a long list of names:

Code:

filterbyname.sh in=file.fa out=filtered.fa include=t names=names.txt substring

The "substring" flag allows partial matches.

**SeqTroubles** · 04-25-2017, 12:33 AM

Originally posted by Brian Bushnell View Post

You can also use BBMap's filterbyname.sh tool, particularly if you have a long list of names:

Code:

filterbyname.sh in=file.fa out=filtered.fa include=t names=names.txt substring

The "substring" flag allows partial matches.

Thanks so much Brian, this is very straight forward. Just a quick question about the tool. My gene name list contains some ambiguous names such as group_XXXX, as it is an output of roary. Would setting substring=t or substring=names cause it to partially match fasta headers from prokka output via the locus tag? If so is there a way to prevent this I have been using the following command:

Code:

filterbyname.sh in=seqs.ffn out=test.fasta include=t names=list.txt substring=name casesensitive=f

There are some seqs on my output which I feel should not be present. Although I do think the casesensitive flag is most likely the culprit?
Thanks.

**Brian Bushnell** · 04-25-2017, 12:04 PM

"substring=names" will consider a sequence to be a match if the sequence name contains any line in list.txt as a substring; and in this case, it's ignoring case. I suggest not ignoring case unless it's essential. Note that if you have any really short names in your file, like "A", it might match just about everything...

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 27 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 26 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

extract fasta sequences from multifasta file using partial or gene names

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News