Seqanswers Leaderboard Ad

**maubp** · 01-11-2010, 04:02 AM

Are you starting with a ~400MB FASTA file, containing ~1000 sequences, and you just want the list of sequence identifiers ("gene names")?

Try something like this at the Unix command line:

grep "^>" my_database.fasta

That string "^>" is a regular expression meaning look for any lines starting ("^") with the greater than symbol.

**johnsequence** · 01-11-2010, 01:15 PM

Thanks for reply. I actually need the IDs (headers) and sequences in FASTA format.

**maubp** · 01-12-2010, 02:49 AM

Originally posted by johnsequence View Post

Thanks for reply. I actually need the IDs (headers) and sequences in FASTA format.

So you have a large FASTA file, and a list of entries you want to extract from it, to give a smaller subset as a new FASTA file? This is a very general problem, and not specific to MAQ at all.

How is your list of identifiers stored? e.g. a text file with one id per line?

I would suggest you write a simple script, e.g. using Perl (perhaps with BioPerl) or Python (perhaps with Biopython), or your preferred script language.

Page not found · GitHub Pages

http://bioperl.org/wiki/HOWTO:SeqIO

Introduction to SeqIO · Biopython

http://biopython.org/wiki/SeqIO

Or, if you are happier just working at the command line, you can probably do this with EMBOSS seqret.

EMBOSS: seqret

http://emboss.sourceforge.net/apps/release/6.1/emboss/apps/seqret.html

**kmcarr** · 01-12-2010, 05:55 AM

You can use a couple of the utilities in the BLAST package from NCBI. Take your large FASTA file and create a BLAST database from it using formatdb. Then retrieve just the sequences you want from the BLASTdb using the fastacmd tool.

Code:

%> formatdb -i <your.FASTA.file> -p F -n <your.blast.db>

%> fastacmd -d <your.blast.db> -i <your.ID.file> > <output.file>

The first command takes your FASTA file and creates your.blast.db. The -p F tells formatdb that this is a nucleotide database. The second command extracts the FASTA formatted reads from the BLAST db based on the list of of IDs in your.ID.file.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 37 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 41 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 35 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

EXTRACT MAQ sequences?

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News