Command: esmfold

ESMFold (Evolutionary Scale Modeling) is an artificial intelligence method for predicting protein structures. The method is described in:

Evolutionary-scale prediction of atomic-level protein structure with a language model. Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, Smetanin N, Verkuil R, Kabeli O, Shmueli Y, Dos Santos Costa A, Fazel-Zarandi M, Sercu T, Candido S, Rives A. Science. 2023 Mar 17;379(6637):1123-1130.

The esmfold command:

finds and retrieves existing models from the ESM Metagenomic Atlas, which contains over 600 million predicted protein structures
runs new ESMFold predictions on the prediction server provided by the ESM Metagenomic Atlas
plots residue-residue alignment errors for ESMFold structures and shows them with colored pseudobonds

ESMFold-predicted structures vary in confidence levels (see coloring) and should be interpreted with caution. The esmfold command is also implemented as the tools ESMFold and ESMFold Error Plot. See the ChimeraX ESMFold example, alphafold, blastprotein, modeller, swapaa

[back to top: esmfold]

Getting Models from the ESM Metagenomic Atlas

Usage: esmfold fetch mgnify-id [ alignTo  chain-spec [ trim  true | false ]] [ colorConfidence  true | false ] [ ignoreCache  true | false ] [ pae  true | false ] [ version  N ]
Usage: esmfold match sequence [ trim  true | false ] [ colorConfidence  true | false ] [ ignoreCache  true | false ] [ pae  true | false ]
Usage: esmfold search sequence [ matrix  similarity-matrix ] [ cutoff  evalue ] [ maxSequences  M ] [ version  N ]

The esmfold fetch command retrieves the model (if available) for a sequence specified by its MGnify identifier. This identifier can be obtained by sequence search at the ESM Metagenomic Atlas. Example:
esmfold fetch MGYP000542242899
The esmfold match command retrieves models for sequences the same as or similar to those of experimentally determined protein structures already open in ChimeraX, or other sequences independent of structure. Giving the model number of an atomic structure already open in ChimeraX specifies all of its protein chains. Examples with sequence given as a chain-spec:
esmfold match #1
esmfold match #3/B,D trim false

Alternatively, the sequence can be given as any of the following:
- the sequence-spec of a sequence in the Sequence Viewer, in the form: alignment-ID:sequence-ID (details...)
- a UniProt name or accession number
- plain text pasted directly into the command line
For a specified structure chain, a model is obtained for the single top hit (closest sequence match) identified by K-mer search of the ESM Metagenomic Atlas. This type of search is fast but low-sensitivity, requiring high % identity for a hit to be found. For each model with a corresponding structure chain from the esmfold match command or the alignTo option of esmfold fetch:
1. the chain ID of the predicted structure is made the same as the corresponding chain of the existing model
2. the predicted structure is superimposed onto the existing chain using matchmaker, and the following are reported in a table in the Log:
  - Chain – chain ID in ChimeraX
  - MGnify Id – sequence ID used by the ESM Metagenomic Atlas
  - RMSD – Cα root-mean-square deviation between the predicted and experimental structures, over all residues of the latter
  - Length – number of residues in the predicted structure
  - Seen – number of residues with atomic coordinates in the experimental structure
  - % Id – percent identity in the sequence alignment generated by matchmaker for superposition; the number of positions with identical residues divided by the length of the shorter sequence
3. the following attributes are assigned to the residues of the predicted structure:
  - c_alpha_distance – Cα distance between corresponding positions of the predicted and existing chains after their superposition (step 2 above)
  - missing_structure – positions missing from the coordinates of the existing chain
  - same_sequence – positions with the same residue type as the existing chain
  These attributes can be used for coloring and other purposes.
The esmfold search command uses a BLAST web service hosted by the UCSF RBVI to search the ESM Metagenomic Atlas. It differs from esmfold match in that it uses BLAST instead of fast (but low-sensitivity) K-mer searching, accepts only a single chain or sequence as input, and returns a list of hits for the user to inspect, rather than fetching the single top hit per chain automatically. The query sequence can be given as any of the following:
- a chain-spec corresponding to a single chain in an atomic structure open in ChimeraX
- the sequence-spec of a sequence in the Sequence Viewer
- a UniProt name or accession number
- plain text pasted directly into the command line
The matrix option indicates which amino acid similarity-matrix to use for scoring the hits (uppercase or lowercase can be used): BLOSUM45, BLOSUM50, BLOSUM62 (default), BLOSUM80, BLOSUM90, PAM30, PAM70, PAM250, or IDENTITY. The cutoff evalue is the maximum or least significant expectation value needed to qualify as a hit (default 1e-3). Results can also be limited with the maxSequences option (default 100); this is the maximum number of unique sequences to return.
When results are returned, the hits are listed in a Blast Protein window. Double-clicking a hit uses esmfold fetch to retrieve the model, or multiple chosen hits can be retrieved at once by using the results panel context menu or Load Structures button (details...).

[back to top: esmfold]

Options

alignTo chain-spec
Superimpose the predicted structure from esmfold fetch onto a single chain in an already-open structure, and make its chain ID the same as that chain's. See also the trim option.

colorConfidence true | false
Whether to color the predicted structures by the pLDDT confidence measure (same as for AlphaFold except mapped to 0-1 instead of 0-100) in the B-factor field (default true):

1.0

to 0.9

– high accuracy expected
0.9

to 0.7

– backbone expected to be modeled well
0.7

to 0.5

– low confidence, caution
0.5

to 0.0

– should not be interpreted, may be disordered

...in other words, using

color bfactor palette esmfold

The Color Key graphical interface or a command can be used to draw a corresponding color key, for example:

key red:low orange: yellow: cornflowerblue: blue:high [other-key-options]

ignoreCache true | false
The fetched models are stored locally in ~/Downloads/ChimeraX/ESMFold/, where ~ indicates a user's home directory. If a file specified for opening is not found in this local cache or ignoreCache is set to true, the file will be fetched and cached.

trim true | false
Whether to trim a predicted protein structure to the same residue range as the corresponding experimental structure given with the esmfold match command or the alignTo option of esmfold fetch. With trim true (default):

Predictions with UniProt identifier determined by esmfold match from the experimental structure's input file will be trimmed to the same residue ranges as used in the experiment. These ranges are given in DBREF records in PDB format and in struct_ref and struct_ref_seq tables in mmCIF.
Predictions retrieved with esmfold fetch or found by esmfold match searching for similar sequences in the ESM Atlas will be trimmed to start and end with the first and last aligned positions in the sequence alignment calculated by matchmaker as part of the superposition step.

Using trim false indicates retaining the full-length models of the sequences, which could be longer.

version N
Which version of the ESM Metagenomic Atlas to use with esmfold fetch and esmfold pae or esmfold search (as well as blastprotein with database esmfold). The default is the most recent version found at the website (currently 0). The esmfold match command always uses the current version and does not have this option.

[back to top: esmfold]

Running an ESMFold Prediction

The esmfold predict command runs a calculation on the prediction server provided by the from the ESM Metagenomic Atlas.

Usage: esmfold predict sequence [ subsequence start,end ] [ residueRange start,end ] [ chunk N ] [ overlap M ]

The protein sequence to predict can be given as any of the following:

a chain-spec corresponding to a single chain in an atomic structure open in ChimeraX
the sequence-spec of a sequence in the Sequence Viewer, in the form: alignment-ID:sequence-ID (details...)
a UniProt name or accession number
plain text pasted directly into the command line

The server has a maximum sequence length of 400 residues. The first three methods above specify an entire sequence, but a subsequence can be:

pasted directly as the sequence
given with subsequence start,end (integers separated by a comma only; subsequence start and end positions relative to the entire sequence starting at 1)
given with residueRange start,end (integers separated by a comma only; subsequence start and end positions in the structure residue numbering, when the sequence is specified as a structure chain)

Alternatively, a single command can be used to predict a long sequence as a series of shorter chunks with overlaps (to allow their superposition). Example:

esmfold predict #1 chunk 400 overlap 20

The predicted structure will be opened automatically and colored by confidence value. The model for a sequence that was specified by structure chain will be superimposed on that chain and assigned structure-comparison attributes for further analysis (details...).

Caveats

ESMFold is faster but often less accurate than AlphaFold.
The prediction server has a maximum sequence length of 400 residues. (The atlas contains longer predictions, up to 1024 residues.)
No multimer prediction. Only single-chain structures (not multimers) are predicted, although structures of the individual protein chains in a complex can be predicted separately.
PAE not available from server. Although ESMFold computes predicted aligned error (PAE), the prediction server does not provide it. The PAE is available for entries fetched from the atlas, however.
The server developers request that users run only one prediction at a time due to capacity limitations.
The server may time out during a prediction.

[back to top: esmfold]

ESMFold Predicted Aligned Error (PAE)

Besides per-residue confidence values, ESMFold gives for each pair of residues (X,Y) the expected position error at residue X if the predicted and true structures were aligned on residue Y. These predicted aligned error or PAE values are not provided by the prediction server or esmfold predict, but are available for structures already in the ESM Metagenomic Atlas and can be shown as a 2D plot by using esmfold fetch or esmfold match with the option pae true, or the command esmfold pae:

Usage: esmfold pae [ model-spec ] ( mgnifyId mgnify-id | file filename ) [ palette palette ] [ range low,high | full ] [ plot true | false ] [ colorDomains true | false ] [ minSize M ] [ connectMaxPae N ] [ cluster resolution ] [ version N ]

With esmfold pae, the matrix of PAE values can be either:

fetched from the ESM Metagenomic Atlas with the mgnifyId option, where mgnify-id can be obtained by sequence search at the ESM Metagenomic Atlas
– or –
a json file from ESMFold (e.g., obtained by URL such as https://api.esmatlas.com/fetchConfidencePrediction/MGYP002537940442) specified with the file option. The filename is generally a pathname to a local file, either absolute or relative to the current working directory as reported by pwd. Substituting the word browse for filename brings up a file browser window for choosing the name and location interactively.

The corresponding ESMfold structure (already open) can be given as a model-spec in the esmfold pae command to associate it with the plot. This association allows coloring by domain as described below, and for selections on the plot to highlight the corresponding parts of the structure.

By default, the PAE plot is drawn when domain coloring is not done (plot is default true when colorDomains is false) and vice versa.

Setting colorDomains to true clusters the residues into coherent domains (sets of residues with relatively low pairwise PAE values) and uses randomly chosen colors to distinguish these domains in the structure. The residues are assigned an integer domain identifier (starting with 1) as an attribute named pae_domain that can be used to specify them in commands (for example, to recolor or select specific domains). Residues not grouped into any domain are assigned a pae_domain value of None. The clustering uses the NetworkX greedy_modularity_communities algorithm with parameters:

minSize (default 10 residues) – minimum number of residues allowed in a domain
connectMaxPae (default 5.0 Å) – the maximum PAE value allowed between residues for them to be clustered into the same domain. Larger values give larger domains and generally increase the time to compute the clustering, which is ~5 seconds for 1000 residues when the default of 5.0 is used.
cluster (default 0.5, typical range 0.5–5.0) – graph resolution; larger values give smaller domains

The default palette for coloring the PAE plot is pae, with colors assigned to values as follows:

Another palette with value range suitable for PAE plots is paegreen:

[back to top: esmfold]

ESMFold Predicted Aligned Error Plot

Besides per-residue confidence values, ESMFold gives for each pair of residues (X,Y) the expected position error at residue X if the predicted and true structures were aligned on residue Y. These predicted aligned error or PAE values are not provided by the prediction server or esmfold predict, but are available for structures already in the ESM Metagenomic Atlas and can be shown as a 2D plot by using esmfold fetch with the option pae true, or the command esmfold pae.

The plot window has buttons for coloring the associated structure:

Color PAE Domains applies coloring by PAE cluster as described above.
Color pLDDT returns the structure to the default confidence coloring.

The plot's context menu includes:

Dragging box colors structure (initial default checked on) – whether dragging a box on the plot highlights the corresponding parts of the 3D structure with bright colors and makes everything else gray; if this option is unchecked, highlighting will be done with selection instead of coloring
Color plot from structure – color the plot to match the 3D structure where the pair of residues represented by an X,Y point have the same ribbon color; show the rest of the plot in shades of gray
Color plot rainbow – use the pae palette (default) to color the plot:
0 5 10 15 20 25 30
Color plot green – use the paegreen palette to color the plot:
0 5 10 15 20 25 30
Show chain divider lines (initial default checked on) – for multimer predictions, draw lines on the plot demarcating the end of one chain and the start of another; the lines may obscure a few chain-terminal residues in the plot, and can be hidden if this is problematic
Save image – save the plot as a PNG file

The Color Key graphical interface or a command can be used to draw (in the main graphics window) a color key for the PAE plot. For example, to make a color key that matches the pae or paegreen scheme, respectively:

key pae :0 : : :15 : : :30 showTool true
key paegreen :0 : : :15 : : :30 showTool true

A title for the color key (e.g., “Predicted Aligned Error (Å)”) would need to be created separately with 2dlabels.

[back to top: esmfold]

Pseudobonds Colored by PAE

Residue-residue PAE values can also be shown with colored pseudobonds in the predicted structure:

Usage: esmfold contacts res-spec1 [ toResidues res-spec2 [ flip true | false ] [ distance d ] [ palette palette ] [ range low,high | full ] [ radius r ] [ dashes N ] [ name model-name ] [ replace true | false ] [ outputFile pae-file ]

A PAE plot containing the specified residues must already be shown. The PAE matrix is not symmetrical. The first specification res-spec1 gives the aligned residues, whereas toResidues res-spec2 gives the residues whose error values are reported, except that using flip true swaps the meaning of res-spec1 and res-spec2. If one set of residues is higher-confidence (lower in pLDDT than the other, it is usually best to specify them as the aligned residues so that the coloring will show the error values of the lower-confidence set.

Omitting the toResidues option defines res-spec2 as all residues covered by the PAE plot except for those in res-spec1; however, if toResidues is omitted and res-spec1 includes all residues in the plot, res-spec2 will also be defined as all residues in the plot.

The distance option allows limiting the number of pseudobonds by only drawing them between pairs of residues with any interresidue distance ≤ d Å (default 3.0). These pseudobonds are drawn between α-carbons regardless of which atoms were within the distance cutoff.

The default palette for coloring the pseudobonds by PAE value is paecontacts, with colors assigned to values as follows:

Although this palette includes value-color pairs, it may be helpful to give a value range if a colors-only palette is used instead. A range can also be used to override the values in a value-color palette, instead spacing the colors evenly across the specified range.

The pseudobond stick radius (default 0.2 Å) and number of dashes (default 1, meaning a solid stick) can also be specified.

The name option allows specifying the pseudobond model-name (default PAE Contacts). If a model by that name already exists, any pre-existing pseudobonds will be removed from that model and replaced by the new ones (replace true, default) unless replace false is used.

The outputFile option allows saving a list of the residue pairs (those meeting the distance criterion) and their PAE values to a plain text file. The pae-file argument is the output file pathname, enclosed in quotation marks if it includes spaces, or the word browse to specify it interactively in a file browser window.

Examples:

esmfold contacts #1
esmfold contacts /A to /B distance 8
esmfold contacts sel palette blue:red range 1,5

The following would select all pseudobonds and label them with the names and numbers of the residues that they connect:

sel pbonds
label sel pseudobonds text "{0.atoms[0].residue.name} {0.atoms[0].residue.number} to {0.atoms[1].residue.name} {0.atoms[1].residue.number}"

UCSF Resource for Biocomputing, Visualization, and Informatics / March 2023