Tom Goddard
goddard@sonic.net
December 18, 2024
Can we run AlphaFold 3 with a protein target against all compounds in FDA approved drugs to search for ones that might bind to the target? Will AlphaFold 3 give useful scores to evaluate small-molecule binding, how long will it take, and how do you actually run it? First we try 3 small test cases running predictions for anti-viral compounds (52) against three virus proteins. Then we run all compounds (1508) in FDA approved drugs against the SARS-CoV-2 main protease.
First we try a few small test cases running 52 anti-viral compounds against a few virus targets: Nipah virus G protein (the spike that mediates cell entry), influenza neuraminidase (enables new virus release from cells), and hiv protease. In the latter two tests there are known drugs for these targets and AlphaFold 3 predictions score those protein + drug complexes near the top when sorted by confidence scores. These complete in 50 minutes, 22 minutes, and 10 minutes on one Nvidia 4090 GPU, the times depending on the protein sequence lengths 602, 390, and 198 (dimer with monomer sequence length 99). The run times scale approximately as the square of the number of protein residues.
Next we try running all 1508 compounds in FDA approved drugs against a protein target, the main protease of SARS-CoV-2, sequence length 306. To speed up the calculation we predict only 2 structures per compound instead of the usual 5. This took 7 hours on a single Nvidia 4090 GPU.
We run a separate AlphaFold 3 prediction of a protein target with each compound. We start with the protein-only input file nipah_g_with_msa.json and a file containing compound names and SMILES strings describing the compound chemical structure virus_drugs_smiles.txt. How to create these two files is described below. The Python script af3_drug_json_input.py then makes an AlphaFold 3 input file with name matching the compound name (e.g. abacavir_sulfate.json) for each compound + protein prediction.
$ mkdir runs $ cd runs $ python3 ../af3_drug_json_input.py ../nipah_g_with_msa.json
Running AlphaFold 3 with the input_dir option makes predictions for every input file in the specified directory sequentially. The "run_data_pipeline false" option says to use the precomputed MSA that is already in the input files.
$ af3_run --input_dir . --run_data_pipeline false >& runs.out
I am running a locally installed AlphaFold 3 (source from github 4 December, 2024) and use a shell script af3_run that I wrote to set the Python environment and then call a Python script I wrote run_alphafold3.py which sets default database and weights locations and other default options and then calls the standard AlphaFold 3 run_alphafold.py script. I prefer to run local non-containerized AlphaFold 3 on a desktop computer (Ubuntu 24.04) because it allows very easy modifications to the AlphaFold 3 code and also offers faster predictions than our local high performance cluster (UCSF Wynton cluster) due to the Nvidia 4090 being faster than Nvidia A40 and NVMe drive being faster than a nework file system (BeeGFS backed by ZFS on spinning drives).
AlphaFold 3 produces confidence scores ipTM (interface predicted template model score), pLDDT (predicted local distance difference test), and PAE (predicted aligned error) that might be useful in judging if a ligand realy binds to the target. The ipTM score is in the *_summary_confidences.json files. The pLDDT score is in the bfactor column of the predicted structure .cif file. The PAE data is the _confidences.json files. To output scores for all of the ligand predictions I wrote ChimeraX script drug_scores.py. Use ChimeraX command to run the script specifying the prediction structure files (.cif)
run drug_scores.py */*_model.cif
The ChimeraX Log panel shows the output. The last 3 columns summarize the number of ligand atoms, protein residues and interaction pairs with low predicted aligned error (PAE <= 3 Angstroms) and distance less than 4 Angstroms (ligand atom to residue atom).
Name ipTM pLDDT PAE<3A/dist<4A/natom,nres,npair darunavir 0.94 97 32 23 60 tipranavir 0.90 92 38 24 68 ritonavir 0.89 92 47 24 78 ...
The scores for the Nipah G protein predictions from drug_scores.py are nipah_g_scores.txt ranked from highest to lowest ipTM score. We have no information on whether any of these bind to the protein. Many have very high confidence scores and it seems unlikely that they bind with any affinity.
Name ipTM pLDDT PAE<3A/dist<4A/natom,nres,npair elvitegravir 0.97 92 26 14 44 ganciclovir_sodium 0.94 87 15 9 34 fostemsavir_tromethamine 0.94 78 37 14 57 valganciclovir_hydrochloride 0.93 79 17 12 29 tenofovir_alafenamide 0.93 77 24 10 33 valacyclovir 0.92 77 21 7 30 tenofovir_alafenamide_fumarate 0.92 73 23 10 31 acyclovir_sodium 0.92 85 13 9 31 adefovir_dipivoxil 0.91 69 25 11 43 ...
The scores for the HIV protease predictions from drug_scores.py are neuraminidase_scores.txt ranked from highest to lowest ipTM score. The 5 drugs classed as Neuraminidase Inhibitor (labeled with "*") are in the top 8 ipTM scoring predictions. (Note oseltamivir phosphate and oseltamavir phosphate are synonyms that I missed.)
Name ipTM pLDDT PAE<3A/dist<4A/natom,nres,npair *zanamivir 0.98 97 19 16 50 *peramivir 0.97 95 21 16 48 *oseltamivir_acid 0.97 96 18 14 37 *oseltamivir 0.96 92 20 14 42 valganciclovir 0.95 87 22 15 53 entecavir 0.95 89 18 13 41 valganciclovir_hydrochloride 0.94 86 24 15 54 *oseltamivir_phosphate 0.94 84 20 13 36 *oseltamavir_phosphate 0.94 84 20 13 36 penciclovir 0.93 83 16 12 35 ...
The scores for the HIV protease predictions from drug_scores.py are hiv_protease_scores.txt ranked from highest to lowest ipTM score. The 10 drugs classed as HIV Protease Inhibitor (labeled with "*") are in the top 12 ipTM scoring predictions.
Name ipTM pLDDT PAE<3A/dist<4A/natom,nres,npair *darunavir 0.94 97 32 23 60 *tipranavir 0.90 92 38 24 68 *ritonavir 0.89 92 47 24 78 *nelfinavir_mesylate 0.89 91 41 25 78 *lopinavir 0.89 93 40 26 69 *atazanavir 0.77 76 19 15 29 *atazanavir_sulfate 0.73 69 13 11 24 *peramivir 0.72 71 14 7 18 cidofovir_anhydrous 0.72 67 1 1 1 abacavir_sulfate 0.72 54 11 5 11 *tenofovir_alafenamide 0.70 56 6 4 11 *fosamprenavir_calcium 0.69 65 9 8 21 ...
The scores for the SARS-CoV-2 main protease predictions from drug_scores.py are sars_cov_2_mpro_scores.txt ranked from highest to lowest ipTM score. The only FDA approved SARS-CoV-2 main protease inhibitor nirmatrelvir (part of paxlovid) does not list active ingredients in the FDA National Drug Code (strange omission).
TODO: Add list of top scoring compounds.
To speed up the predictions you can compute just 1 structure instead of the default 5 per ligand using
$ af3_run --input_dir . --run_data_pipeline false --num_diffusion_samples 1 >& runs.out
The HIV protease test run took 5 minutes for 52 predictions with 1 structure per prediction (one compound 6 seconds) versus 10 minutes with 5 structures (one compound 12 seconds). At this rate of about 10 predictions per minute on one Nvidia 4090 GPU we could run 600 predictions per hour, so predicting for all 1500 compounds in FDA approved drugs would be feasible in a few hours.
The speed of prediction is highly dependent on the protein sequence length. The HIV protease dimer has only 198 amino acids. For the SARS-CoV-2 main protease length 306 monomer sequence predictions took about 16.5 seconds per compound for 2 structures allowing 1500 compounds to be run in 7 hours.
Most of the time in an AlphaFold 3 prediction is spent computing a multiple sequence alignment (MSA). For instance for the 602 amino acid Nipah virus G-protein sequence the MSA calculation took about 5 minutes while predicting 5 structures took about 50 seconds. To save time predicting the same protein with 20,000 different FDA approved drugs we want to compute the MSA one time and reuse it. This is easy to do by running the protein target prediction one time with no ligands.
$ af3_run --json_path nipah_g.json >& nipah_g.out
I made the AlphaFold 3 json input file nipah_g.json in a text editor. The run will produce a new input file nipah_g_data.json that includes the MSA in the results directory nipah_g which we rename to nipah_g_with_msa.json for subsequent AlphaFold runs with ligands.
The MSAs can be large 10-100 Mbytes and we will duplicate this file for each ligand. To save space the MSAs can be place in a separate .a3m files and the .json file can reference the external file using the input file "unpairedMsaPath", "pairedMsaPath" and "mmcifPath" attributes. The last attribute is from mmCIF structure templates and the file suffix should be .cif. I wrote a script af3_extract_msa.py to extract the MSA and mmCIF templates to from an AlphaFold 3 json input into separate files. Unfortunately running AlphaFold 3 always creates in the results directory a new input file {name}_model.json that includes the MSA and mmCIF reducing the space savings.
I created a list of 52 anti-viral compounds virus_drugs.txt found in the FDA National Drug Code Directory. The list virus_drugs_classes.txt also lists the pharmacological classes for the drugs the compound was found in.
I chose a set of anti-viral drugs by taking the FDA National Drug Code Directory drug-ndc-0001-of-0001.json file from the OpenFDA downloads web site (updated weekly, I used the Dec 11, 2024 drug-ndc-0001-of-0001.json file 238 Mbytes). The FDA National Drug Code (NDC) Directory contains information about finished drug products, unfinished drugs and compounded drug products.
I wrote a Python script filter_fda_drugs2.py to scan this JSON file for entries where product_type is "HUMAN PRESCRIPTION DRUG" with active ingredients ending in "vir" or containing a word ending in "vir". Almost all the active ingredients "vir" are anti-viral compounds. The script outputs a list of compound names and takes a few seconds to output 52 compounds.
$ python3 filter_fda_drugs2.py drug-ndc-0001-of-0001.json abacavir sulfate acyclovir acyclovir sodium adefovir dipivoxil ...
Adding the word "class" as an option when running the filter script outputs the pharm_class (pharmacologic class) field which gives an idea of what the drug does, for example, "Neuraminidase Inhibitors [MoA]". The suffix [MoA] means "method of action" and "[EPC]" means "established pharmacologic action".
$ python3 filter_fda_drugs2.py drug-ndc-0001-of-0001.json class abacavir sulfate Cytochrome P450 1A1 Inhibitors [MoA], ... acyclovir Herpes Zoster Virus Nucleoside Analog DNA Polymerase Inhibitor [EPC], ... ...
AlphaFold 3 can include small molecules in predictions using SMILES strings. I got the SMILES strings for our anti-viral compounds from PubChem using Python script pubchem_smiles2.py. I takes 20 seconds to run and fetches the SMILES strings for each of the 52 compounds virus_drugs_smiles.txt.
$ python3 pubchem_smiles2.py < virus_drugs.txt abacavir sulfate, C1CC1NC2=C3C(=NC(=N2)N)N(C=N3)[C@@H]4C[C@@H](C=C4)CO.C1CC1NC2=C3C(=NC(=N2)N)N(C=N3)[C@@H]4C[C@@H](C=C4)CO.OS(=O)(=O)O acyclovir, C1=NC2=C(N1COCCO)N=C(NC2=O)N acyclovir sodium, C1=NC2=C(N1COCCO)N=C(N=C2[O-])N.[Na+] adefovir dipivoxil, CC(C)(C)C(=O)OCOP(=O)(COCCN1C=NC2=C(N=CN=C21)N)OCOC(=O)C(C)(C)C ...