Usage:
blastprotein sequence
[ database sequence-database ]
[ version 1 | 2 | 3 ]
[ matrix similarity-matrix ]
[ cutoff evalue ]
[ maxSeqs M ]
[ log true | false ]
[ name N ]
[ showResultsTable true | false ]
[ loadStructures true | false ]
[ showSequenceAlignment true | false ]
[ onlyBest true | false ]
The blastprotein command runs a protein sequence similarity search
using a BLAST web service
hosted by the
UCSF
Resource for Biocomputing, Visualization, and Informatics (RBVI).
It is the command implementation of the
Blast Protein tool.
One use is to search with a target sequence of unknown structure
to find templates for comparative modeling.
The related tool Foldseek
(Similar Structures) can also search with BLAST and other methods,
but only using a structure chain as the query; it facilitates
exploring large sets of similar structures by efficiently
showing them in 3D as backbone traces and in 2D as sequence alignment
schematics or scatter plots based on conformation.
See also:
similarstructures blast,
similarstructures fromblast,
alphafold search,
esmfold search
The query sequence can be given as any of the following:
The protein sequence-database to search can be:
- pdb (default)
– experimentally determined structures in the
Protein Data Bank (PDB)
- nr
– NCBI “non-redundant” database containing
GenBank
CDS translations + PDB
+ SwissProt +
PIR
+ PRF
excluding environmental samples from whole-genome sequencing; this database
is much larger than pdb alone and takes much longer to search
- alphafold
– artificial-intelligence-predicted structures in the
AlphaFold Database
(more...)
- the version option applies only to the alphafold database
(details...)
- esmfold – artificial-intelligence-predicted structures in the
ESM
Metagenomic Atlas (more...)
- uniref100 –
UniProt Reference Cluster at 100% identity
(identical sequences and subfragments collapsed into a single entry)
- uniref90 – based on uniref100, but omitting
sequences shorter than 11 residues and clustering at 90% identity
- uniref50 – based on uniref100, but omitting
sequences shorter than 11 residues and clustering at 50% identity
The matrix option indicates which amino acid similarity-matrix
to use for alignment scoring (uppercase or lowercase can be used):
- BLOSUM45
- BLOSUM50
- BLOSUM62 (default)
- BLOSUM80
- BLOSUM90
- PAM30
- PAM70
- PAM250
- IDENTITY
The cutoff evalue is the maximum or least significant
E-value
needed to qualify as a hit (default 1e-3).
Results can also be limited with the maxSeqs option
(default 100); this is the
maximum number of unique sequences to return; more hits than this number
may be obtained because multiple structures or other sequence-database entries
may have the same sequence.
The remaining options control what happens when the search completes:
- The showResultsTable option indicates whether the results should be
shown in an interactive table in a
separate window
(default true). In this table,
many columns of information can be shown and used to sort the hits:
alignment scores, structure resolution, ligand residue names, etc.
(details...)
The name option allows supplying a name for a specific set of
Blast Protein results,
which may be useful when several sets of results are shown at the same time.
The name appears in the title bar of the results panel.
In the results panel, double-clicking a row with
an associated structure (from searching PDB, AlphaFold, ESMFold)
fetches the structure,
and if a structure chain was used as the query,
automatically superimposes the hit and query structures with
matchmaker.
If the query was sequence-only (not a structure chain), the first
structure opened from the results will serve as the reference for
superimposing the others.
AlphaFold-predicted structures are
colored by confidence 0-100.
ESMFold-predicted structures are
colored by confidence 0-1.
One or more hits can be chosen (highlighted) in the list and
the panel's context menu
used to fetch and superimpose all of the corresponding structures, or to
show their multiple
sequence alignment with the query
in the Sequence Viewer.
- The log option indicates whether to also list the results in the
Log (default false).
- The loadStructures option indicates that for each hit with an
associated structure (from searching PDB, AlphaFold, ESMFold),
the structure should be fetched and superimposed
as described above (default false).
Setting this option to true automatically opens a potentially large
number of structures. If onlyBest is false (default), this
could include multiple copies of the same PDB entry (e.g., a
homotetramer structure could be opened four times, each corresponding to
a match to one of the monomers).
If onlyBest is true, a given PDB entry will only be opened once
and matched on the chain with the best sequence alignment score.
To open only specific structures of interest, use the results table instead.
- The showSequenceAlignment option indicates whether to show the
multiple sequence alignment
of all hits with the query
in the Sequence Viewer
(default false). Similar to loading all structures, setting this option
to true may give an alignment with a very large number of sequences.
If onlyBest is false (default), this may include multiple chains
from the same PDB entry (e.g., the four chains of a homotetramer).
If onlyBest is true, only the chain with the best alignment
score will be included for a given PDB entry.
To include only specific sequences of interest, use the results table instead.
UCSF Resource for Biocomputing, Visualization, and Informatics /
November 2024