Unconfigured Ad

**GenoMax** · 05-11-2015, 09:50 AM

If you run "blastx -help" on the command line you will get all options for blastx. One of the sections is for output formats. Default format is "0".

Code:

 -outfmt <String>
   alignment view options:
     0 = pairwise,
     1 = query-anchored showing identities,
     2 = query-anchored no identities,
     3 = flat query-anchored, show identities,
     4 = flat query-anchored, no identities,
     5 = XML Blast output,
     6 = tabular,
     7 = tabular with comment lines,
     8 = Text ASN.1,
     9 = Binary ASN.1,
    10 = Comma-separated values,
    11 = BLAST archive format (ASN.1) 
    12 = JSON Seqalign output 
   
   Options 6, 7, and 10 can be additionally configured to produce
   a custom format specified by space delimited format specifiers.
   The supported format specifiers are:
            qseqid means Query Seq-id
               qgi means Query GI
              qacc means Query accesion
           qaccver means Query accesion.version
              qlen means Query sequence length
            sseqid means Subject Seq-id
         sallseqid means All subject Seq-id(s), separated by a ';'
               sgi means Subject GI
            sallgi means All subject GIs
              sacc means Subject accession
           saccver means Subject accession.version
           sallacc means All subject accessions
              slen means Subject sequence length
            qstart means Start of alignment in query
              qend means End of alignment in query
            sstart means Start of alignment in subject
              send means End of alignment in subject
              qseq means Aligned part of query sequence
              sseq means Aligned part of subject sequence
            evalue means Expect value
          bitscore means Bit score
             score means Raw score
            length means Alignment length
            pident means Percentage of identical matches
            nident means Number of identical matches
          mismatch means Number of mismatches
          positive means Number of positive-scoring matches
           gapopen means Number of gap openings
              gaps means Total number of gaps
              ppos means Percentage of positive-scoring matches
            frames means Query and subject frames separated by a '/'
            qframe means Query frame
            sframe means Subject frame
              btop means Blast traceback operations (BTOP)
           staxids means unique Subject Taxonomy ID(s), separated by a ';'
                         (in numerical order)
         sscinames means unique Subject Scientific Name(s), separated by a ';'
         scomnames means unique Subject Common Name(s), separated by a ';'
        sblastnames means unique Subject Blast Name(s), separated by a ';'
                         (in alphabetical order)
        sskingdoms means unique Subject Super Kingdom(s), separated by a ';'
                         (in alphabetical order) 
            stitle means Subject Title
        salltitles means All Subject Title(s), separated by a '<>'
           sstrand means Subject Strand
             qcovs means Query Coverage Per Subject
           qcovhsp means Query Coverage Per HSP
   When not provided, the default value is:
   'qseqid sseqid pident length mismatch gapopen qstart qend sstart send
   evalue bitscore', which is equivalent to the keyword 'std'
   Default = `0'

When using a multi-fasta input file, each sequence will produce an output block that will start with "Query=" line and end with the "Effective search space…" line.

**kmcarr** · 05-11-2015, 12:07 PM

Originally posted by hlyates View Post

Fantastic. Many salutations and thanks. Can you please point me to the NCBI docs which talk about this? Dr. Google didn't report a hit back for me. Would like to learn more about the formatting.

It has been so long ago I don't recall if I ever did read any formal documentation on the Pairwise (default output 0) format or just figured it out through trial and error.

As to parsing these plain text reports I have in the past used both BioPerl Bio::SearchIO and hand rolled code; pluses and minus in each. But as a general recommendation I would say that if you foresee needing to regularly parse BLAST reports I would avoid the default "Pairwise" output format altogether. The problem is that the format wasn't really designed for automated parsing and so parsing code is easily broken. If you plan to do a lot of parsing then the two better choices are tabular or XML. If your needs are simple (e.g. hit ids, evalues, start and end locations) then tabular is the way to go. Output is small and is very simple to parse. If you have more complex needs (e.g. capturing query/target alignments) then have your BLAST job output XML and parse it using the available modules from BioPerl, BioPython, etc. Since XML is such a structured format automated parsing is more robust. Also, since the XML format retains all of the information present in the Pairwise format it is possible to convert the interesting bits of the XML output into human readable form (again using the 'Bio' modules).

**hlyates** · 05-11-2015, 12:13 PM

Thank you Geno Max and kmcarr. I am thinking I may run my job again and use tabular/xml output. I can then more easily apply a script to it. If I understand you both, it seems biopython can parse tabular quite easily. I might go that route because I just need basic information such as:

hit id
e-value
input sequence (the input sequence that alignment with something in nt database)
target sequence organism id and name

You pros are great and if I knew you in person I would buy you both some drinks.

**GenoMax** · 05-11-2015, 01:50 PM

Provided you have access to a cluster and if you are going to do this over then I suggest you break-up your original file into multiple smaller ones and run the blast jobs in parallel. It would speed things up significantly.

**hlyates** · 05-12-2015, 08:52 AM

Originally posted by GenoMax View Post

Order of operations:

1. Download all Drosophila proteins from tax browser link (since this is what your collaborator seems to want you to do) as multi-fasta format file.
2. Make blast database using the fasta file.
3. Blastx with your sequences using parameters you want (e-value cutoff etc). I would just choose the tabular output format since you can grab the sequence ID's that show a hit from this table.
4. Use faFilter utility (http://hgdownload.soe.ucsc.edu/admin...86_64/faFilter) to get a subset that contains sequences from your list that hit Drosophila proteins.

Since I am downloading a fasta file of proteins, when I make the database should the -dbtype option be:

1. nucl? (I think this is nucleotide)
2. prot (this is for proteins and hence what I should choose)?

I feel stupid for not having thought of that question in advance. Only ran into it as I actually started reading the

Code:

makeblastdb -help

docs.

**hlyates** · 05-12-2015, 09:19 AM

Originally posted by hlyates View Post

Since I am downloading a fasta file of proteins, when I make the database should the -dbtype option be:

1. nucl? (I think this is nucleotide)
2. prot (this is for proteins and hence what I should choose)?

I feel stupid for not having thought of that question in advance. Only ran into it as I actually started reading the docs

Code:

makeblastdb -help

.

Thank you so much for your patience while I learn. As I indicated, I am a one dog show, so this is the only true outlet I have to learn. I am very humbled by your assistance.

Topics	Statistics	Last Post
High-Resolution Sequencing Exposes Hidden Toxoplasma Diversity by SEQadmin2 Started by SEQadmin2, Yesterday, 11:08 AM	0 responses 7 views 0 reactions	Last Post by SEQadmin2 Yesterday, 11:08 AM
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, 06-30-2026, 05:37 AM	0 responses 11 views 0 reactions	Last Post by SEQadmin2 06-30-2026, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 20 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 53 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM

Unconfigured Ad

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News