Unconfigured Ad

**maubp** · 01-11-2010, 04:02 AM

Are you starting with a ~400MB FASTA file, containing ~1000 sequences, and you just want the list of sequence identifiers ("gene names")?

Try something like this at the Unix command line:

grep "^>" my_database.fasta

That string "^>" is a regular expression meaning look for any lines starting ("^") with the greater than symbol.

**johnsequence** · 01-11-2010, 01:15 PM

Thanks for reply. I actually need the IDs (headers) and sequences in FASTA format.

**maubp** · 01-12-2010, 02:49 AM

Originally posted by johnsequence View Post

Thanks for reply. I actually need the IDs (headers) and sequences in FASTA format.

So you have a large FASTA file, and a list of entries you want to extract from it, to give a smaller subset as a new FASTA file? This is a very general problem, and not specific to MAQ at all.

How is your list of identifiers stored? e.g. a text file with one id per line?

I would suggest you write a simple script, e.g. using Perl (perhaps with BioPerl) or Python (perhaps with Biopython), or your preferred script language.

Page not found · GitHub Pages

http://bioperl.org/wiki/HOWTO:SeqIO

Introduction to SeqIO · Biopython

http://biopython.org/wiki/SeqIO

Or, if you are happier just working at the command line, you can probably do this with EMBOSS seqret.

EMBOSS: seqret

http://emboss.sourceforge.net/apps/release/6.1/emboss/apps/seqret.html

**kmcarr** · 01-12-2010, 05:55 AM

You can use a couple of the utilities in the BLAST package from NCBI. Take your large FASTA file and create a BLAST database from it using formatdb. Then retrieve just the sequences you want from the BLASTdb using the fastacmd tool.

Code:

%> formatdb -i <your.FASTA.file> -p F -n <your.blast.db>

%> fastacmd -d <your.blast.db> -i <your.ID.file> > <output.file>

The first command takes your FASTA file and creates your.blast.db. The -p F tells formatdb that this is a nucleotide database. The second command extracts the FASTA formatted reads from the BLAST db based on the list of of IDs in your.ID.file.

Topics	Statistics	Last Post
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 24 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 41 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 48 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM
A New Single-Cell Method Maps DNA-Protein Interactions by SEQadmin2 Started by SEQadmin2, 06-04-2026, 08:59 AM	0 responses 49 views 0 reactions	Last Post by SEQadmin2 06-04-2026, 08:59 AM

Unconfigured Ad

EXTRACT MAQ sequences?

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News