Command: blastprotein

Usage:
blastprotein sequence [ database sequence-database ] [ version 1 | 2 | 3 ] [ matrix similarity-matrix ] [ cutoff evalue ] [ maxSeqs M ] [ log true | false ] [ name N ] [ showResultsTable true | false ] [ loadStructures true | false ] [ showSequenceAlignment true | false ] [ onlyBest true | false ]

The blastprotein command runs a protein sequence similarity search using a BLAST web service hosted by the UCSF Resource for Biocomputing, Visualization, and Informatics (RBVI). It is the command implementation of the Blast Protein tool. One use is to search with a target sequence of unknown structure to find templates for comparative modeling.

The related tool Foldseek (Similar Structures) can also search with BLAST and other methods, but only using a structure chain as the query; it facilitates exploring large sets of similar structures by efficiently showing them in 3D as backbone traces and in 2D as sequence alignment schematics or scatter plots based on conformation. See also: similarstructures blast, similarstructures fromblast, alphafold search, esmfold search

The query sequence can be given as any of the following:

a chain-spec corresponding to a single chain in an atomic structure open in ChimeraX
the sequence-spec of a sequence in the Sequence Viewer
a UniProt name or accession number
plain text pasted directly into the command line

The protein sequence-database to search can be:

pdb (default) – experimentally determined structures in the Protein Data Bank (PDB)
nr – NCBI “non-redundant” database containing GenBank CDS translations + PDB + SwissProt + PIR + PRF excluding environmental samples from whole-genome sequencing; this database is much larger than pdb alone and takes much longer to search
alphafold – artificial-intelligence-predicted structures in the AlphaFold Database (more...)
- the version option applies only to the alphafold database (details...)
esmfold – artificial-intelligence-predicted structures in the ESM Metagenomic Atlas (more...)
uniref100 – UniProt Reference Cluster at 100% identity (identical sequences and subfragments collapsed into a single entry)
uniref90 – based on uniref100, but omitting sequences shorter than 11 residues and clustering at 90% identity
uniref50 – based on uniref100, but omitting sequences shorter than 11 residues and clustering at 50% identity

The matrix option indicates which amino acid similarity-matrix to use for alignment scoring (uppercase or lowercase can be used):

BLOSUM45
BLOSUM50
BLOSUM62 (default)
BLOSUM80
BLOSUM90
PAM30
PAM70
PAM250
IDENTITY

The cutoff evalue is the maximum or least significant E-value needed to qualify as a hit (default 1e-3). Results can also be limited with the maxSeqs option (default 100); this is the maximum number of unique sequences to return; more hits than this number may be obtained because multiple structures or other sequence-database entries may have the same sequence.

The remaining options control what happens when the search completes:

The showResultsTable option indicates whether the results should be shown in an interactive table in a separate window (default true). In this table, many columns of information can be shown and used to sort the hits: alignment scores, structure resolution, ligand residue names, etc. (details...) The name option allows supplying a name for a specific set of Blast Protein results, which may be useful when several sets of results are shown at the same time. The name appears in the title bar of the results panel.
In the results panel, double-clicking a row with an associated structure (from searching PDB, AlphaFold, ESMFold) fetches the structure, and if a structure chain was used as the query, automatically superimposes the hit and query structures with matchmaker. If the query was sequence-only (not a structure chain), the first structure opened from the results will serve as the reference for superimposing the others. AlphaFold-predicted structures are colored by confidence 0-100. ESMFold-predicted structures are colored by confidence 0-1. One or more hits can be chosen (highlighted) in the list and the panel's context menu used to fetch and superimpose all of the corresponding structures, or to show their multiple sequence alignment with the query in the Sequence Viewer.
The log option indicates whether to also list the results in the Log (default false).
The loadStructures option indicates that for each hit with an associated structure (from searching PDB, AlphaFold, ESMFold), the structure should be fetched and superimposed as described above (default false). Setting this option to true automatically opens a potentially large number of structures. If onlyBest is false (default), this could include multiple copies of the same PDB entry (e.g., a homotetramer structure could be opened four times, each corresponding to a match to one of the monomers). If onlyBest is true, a given PDB entry will only be opened once and matched on the chain with the best sequence alignment score. To open only specific structures of interest, use the results table instead.
The showSequenceAlignment option indicates whether to show the multiple sequence alignment of all hits with the query in the Sequence Viewer (default false). Similar to loading all structures, setting this option to true may give an alignment with a very large number of sequences. If onlyBest is false (default), this may include multiple chains from the same PDB entry (e.g., the four chains of a homotetramer). If onlyBest is true, only the chain with the best alignment score will be included for a given PDB entry. To include only specific sequences of interest, use the results table instead.

UCSF Resource for Biocomputing, Visualization, and Informatics / November 2024