AlphaFold 3 Drug Screening

Tom Goddard
goddard@sonic.net
December 18, 2024

Can we run AlphaFold 3 with a protein target against all compounds in FDA approved drugs to search for ones that might bind to the target? Will AlphaFold 3 give useful scores to evaluate small-molecule binding, how long will it take, and how do you actually run it? First we try 3 small test cases running predictions for anti-viral compounds (52) against three virus proteins. Then we run a larger set of compounds (1508) from FDA approved prescription drugs against the SARS-CoV-2 main protease.

Test of anti-viral compounds against Nipah G protein, influenza neuraminidase and HIV protease

First we try a few small test cases running 52 anti-viral compounds against a few virus targets: Nipah virus G protein (the spike that mediates cell entry), influenza neuraminidase (enables new virus release from cells), and hiv protease. In the latter two tests there are known drugs for these targets and AlphaFold 3 predictions score those protein + drug complexes near the top when sorted by confidence scores. These complete in 50 minutes, 22 minutes, and 10 minutes on one Nvidia 4090 GPU, the times depending on the protein sequence lengths 602, 390, and 198 (dimer with monomer sequence length 99). The run times scale approximately as the square of the number of protein residues.

Test of all compounds in FDA approved drugs against SARS-CoV-2 main protease

Next we try running all 1508 compounds in FDA approved prescription drugs against a protein target, the main protease of SARS-CoV-2, sequence length 306. The main protease is a homodimer but to speed up the calculation we ran predictions for a monomer. The active site is not near the dimer interface. We also predict only 2 structures per compound instead of the usual 5. This took 7 hours on a single Nvidia 4090 GPU (minsky.cgl.ucsf.edu). On an Nvidia A40 GPU (UCSF Wynton cluster) these predictions take about 14 hours.

Running AlphaFold 3 predictions

We run a separate AlphaFold 3 prediction of a protein target with each compound. We start with the protein-only input file nipah_g_with_msa.json and a file containing compound names and SMILES strings describing the compound chemical structure virus_drugs_smiles.txt. How to create these two files is described below. The Python script af3_drug_json_input.py then makes an AlphaFold 3 input file with name matching the compound name (e.g. abacavir_sulfate.json) for each compound + protein prediction.

      $ mkdir runs
      $ cd runs
      $ python3 ../af3_drug_json_input.py ../nipah_g_with_msa.json

Running AlphaFold 3 with the input_dir option makes predictions for every input file in the specified directory sequentially. The "run_data_pipeline false" option says to use the precomputed MSA that is already in the input files.

      $ af3_run --input_dir . --run_data_pipeline false >& runs.out

Computer equipment

I am running a locally installed AlphaFold 3 (source from github 4 December, 2024) and use a shell script af3_run that I wrote to set the Python environment and then call a Python script I wrote run_alphafold3.py which sets default database and weights locations and other default options and then calls the standard AlphaFold 3 run_alphafold.py script. I prefer to run local non-containerized AlphaFold 3 on a desktop computer (Ubuntu 24.04) because it allows very easy modifications to the AlphaFold 3 code and also offers faster predictions than our local high performance cluster (UCSF Wynton cluster) due to the Nvidia 4090 being faster than Nvidia A40 and NVMe drive being faster than a nework file system (BeeGFS backed by ZFS on spinning drives).

Evaluating results

AlphaFold 3 produces confidence scores ipTM (interface predicted template model score), pLDDT (predicted local distance difference test), and PAE (predicted aligned error) that might be useful in judging if a ligand realy binds to the target. The ipTM score is in the *_summary_confidences.json files. The pLDDT score is in the bfactor column of the predicted structure .cif file. The PAE data is the _confidences.json files. To output scores for all of the ligand predictions I wrote ChimeraX script drug_scores.py. Use ChimeraX command to run the script specifying the prediction structure files (.cif)

  run drug_scores.py */*_model.cif

The ChimeraX Log panel shows the output. The last 3 columns summarize the number of ligand atoms, protein residues and interaction pairs with low predicted aligned error (PAE <= 3 Angstroms) and distance less than 4 Angstroms (ligand atom to residue atom).

  Name ipTM pLDDT PAE<3A/dist<4A/natom,nres,npair
  darunavir 0.94 97 32 23 60
  tipranavir 0.90 92 38 24 68
  ritonavir 0.89 92 47 24 78
  ...

Results for Nipah G protein

The scores for the Nipah G protein predictions from drug_scores.py are nipah_g_scores.txt ranked from highest to lowest ipTM score. We have no information on whether any of these bind to the protein. Many have very high confidence scores and it seems unlikely that they bind with any affinity.

Name ipTM pLDDT PAE<3A/dist<4A/natom,nres,npair
elvitegravir 0.97 92 26 14 44
ganciclovir_sodium 0.94 87 15 9 34
fostemsavir_tromethamine 0.94 78 37 14 57
valganciclovir_hydrochloride 0.93 79 17 12 29
tenofovir_alafenamide 0.93 77 24 10 33
valacyclovir 0.92 77 21 7 30
tenofovir_alafenamide_fumarate 0.92 73 23 10 31
acyclovir_sodium 0.92 85 13 9 31
adefovir_dipivoxil 0.91 69 25 11 43
...

Results for influenza neuraminidase

The scores for the HIV protease predictions from drug_scores.py are neuraminidase_scores.txt ranked from highest to lowest ipTM score. The 5 drugs classed as Neuraminidase Inhibitor (labeled with "*") are in the top 8 ipTM scoring predictions. (Note oseltamivir phosphate and oseltamavir phosphate are synonyms that I missed.) The neuraminidase sequence used was from PDB 1a4g.

Name ipTM pLDDT PAE<3A/dist<4A/natom,nres,npair
*zanamivir 0.98 97 19 16 50
*peramivir 0.97 95 21 16 48
*oseltamivir_acid 0.97 96 18 14 37
*oseltamivir 0.96 92 20 14 42
valganciclovir 0.95 87 22 15 53
entecavir 0.95 89 18 13 41
valganciclovir_hydrochloride 0.94 86 24 15 54
*oseltamivir_phosphate 0.94 84 20 13 36
*oseltamavir_phosphate 0.94 84 20 13 36
penciclovir 0.93 83 16 12 35
...

Results for HIV protease

The scores for the HIV protease predictions from drug_scores.py are hiv_protease_scores.txt ranked from highest to lowest ipTM score. The 10 drugs classed as HIV Protease Inhibitor (labeled with "*") are in the top 12 ipTM scoring predictions. The HIV protease sequence used was UniProt O90777.

Name ipTM pLDDT PAE<3A/dist<4A/natom,nres,npair
*darunavir 0.94 97 32 23 60
*tipranavir 0.90 92 38 24 68
*ritonavir 0.89 92 47 24 78
*nelfinavir_mesylate 0.89 91 41 25 78
*lopinavir 0.89 93 40 26 69
*atazanavir 0.77 76 19 15 29
*atazanavir_sulfate 0.73 69 13 11 24
*peramivir 0.72 71 14 7 18
cidofovir_anhydrous 0.72 67 1 1 1
abacavir_sulfate 0.72 54 11 5 11
*tenofovir_alafenamide 0.70 56 6 4 11
*fosamprenavir_calcium 0.69 65 9 8 21
...

Results for SARS-CoV-2 main protease

The scores for the SARS-CoV-2 main protease monomer predictions computed with script drug_scores.py are sars_cov_2_mpro_scores.txt ranked from highest to lowest ipTM score. The only FDA approved SARS-CoV-2 main protease inhibitor nirmatrelvir (part of paxlovid) does not list active ingredients in the FDA National Drug Code (strange omission). Here is a zip file sars_cov_2_mpro_drugs.zip (234 MBytes) containing all the predictions and the full predicted aligned error scores. I also predicted MPro monomer and nirmatrelvir separately and it has the higher ipTM than the other 1508 ligands. The SARS-CoV-2 MPro sequence used was sars_cov_2_mpro.fasta.

Name ipTM pLDDT PAE<3A/dist<4A/natom,nres,npair
*nirmatrelvir 0.98 97 26 21 54
omaveloxolone 0.97 95 23 14 34
paroxetine_hydrochloride 0.96 90 17 14 41
fexinidazole 0.96 93 18 15 46
triamcinolone_hexacetonide 0.95 89 19 11 32
rotigotine 0.95 89 17 16 39
prazosin_hydrochloride 0.95 88 23 12 39
paroxetine_mesylate 0.95 88 21 15 49
decitabine 0.95 92 15 12 41
danicopan 0.95 91 26 17 49
pentostatin 0.94 88 18 11 40
luliconazole 0.94 86 16 14 31
imatinib_mesylate 0.94 85 26 19 57
glecaprevir 0.94 87 42 26 87
doravirine 0.94 88 25 19 57
...

Variability of results

To see if the predictions depend on the GPU used I ran the top 150 ligands of the SARS-CoV-2 main protease test on an Nvidia A40 GPU with 48 GB of memory and AMD EPYC 7543P CPU on the UCSF Wynton cluster. It took 1.5 hours so it should take about 15 hours to run all 1500 ligands. Somewhat surprisingly the resulting predicted structures were different despite the requested random seeds in the input files being the same. The differences appear to be small. Here are the scores A40_scores.txt of the A40 GPU predictions which can be compared to the Nvidia 4090 scores sars_cov_2_mpro_scores.txt above.

Name ipTM pLDDT PAE<3A/dist<4A/natom,nres,npair
omaveloxolone 0.97 95 23 14 34
nirmatrelvir 0.97 95 26 20 54
paroxetine_hydrochloride 0.96 92 16 14 35
fexinidazole 0.96 93 18 15 44
doravirine 0.96 94 23 18 57
decitabine 0.96 92 15 12 40
triamcinolone_hexacetonide 0.95 89 19 11 32
rotigotine 0.95 89 17 16 39
...

Speeding it up

To speed up the predictions you can compute just 1 structure instead of the default 5 per ligand using

      $ af3_run --input_dir . --run_data_pipeline false --num_diffusion_samples 1 >& runs.out

The HIV protease test run took 5 minutes for 52 predictions with 1 structure per prediction (one compound 6 seconds) versus 10 minutes with 5 structures (one compound 12 seconds). At this rate of about 10 predictions per minute on one Nvidia 4090 GPU we could run 600 predictions per hour, so predicting for all 1500 compounds in FDA approved drugs would be feasible in a few hours.

The speed of prediction is highly dependent on the protein sequence length. The HIV protease dimer has only 198 amino acids. For the SARS-CoV-2 main protease length 306 monomer sequence predictions took about 16.5 seconds per compound for 2 structures allowing 1500 compounds to be run in 7 hours.

Compute the protein target sequence alignment once

Most of the time in an AlphaFold 3 prediction is spent computing a multiple sequence alignment (MSA). For instance for the 602 amino acid Nipah virus G-protein sequence the MSA calculation took about 5 minutes while predicting 5 structures took about 50 seconds. To save time predicting the same protein with 20,000 different FDA approved drugs we want to compute the MSA one time and reuse it. This is easy to do by running the protein target prediction one time with no ligands.

      $ af3_run --json_path nipah_g.json >& nipah_g.out

I made the AlphaFold 3 json input file nipah_g.json in a text editor. The run will produce a new input file nipah_g_data.json that includes the MSA in the results directory nipah_g which we rename to nipah_g_with_msa.json for subsequent AlphaFold runs with ligands.

Separate MSA and mmCIF templates from prediction input file

The MSAs can be large 10-100 Mbytes and we will duplicate this file for each ligand. To save space the MSAs can be place in a separate .a3m files and the .json file can reference the external file using the input file "unpairedMsaPath", "pairedMsaPath" and "mmcifPath" attributes. The last attribute is from mmCIF structure templates and the file suffix should be .cif. I wrote a script af3_extract_msa.py to extract the MSA and mmCIF templates to from an AlphaFold 3 json input into separate files. Unfortunately running AlphaFold 3 always creates in the results directory a new input file {name}_model.json that includes the MSA and mmCIF reducing the space savings.

Finding FDA approved anti-viral chemical compounds

I created a list of 52 anti-viral compounds virus_drugs.txt found in the FDA National Drug Code Directory. The list virus_drugs_classes.txt also lists the pharmacological classes for the drugs the compound was found in.

I chose a set of anti-viral drugs by taking the FDA National Drug Code Directory drug-ndc-0001-of-0001.json file from the OpenFDA downloads web site (updated weekly, I used the Dec 11, 2024 drug-ndc-0001-of-0001.json file 238 Mbytes). The FDA National Drug Code (NDC) Directory contains information about finished drug products, unfinished drugs and compounded drug products.

I wrote a Python script filter_fda_drugs2.py to scan this JSON file for entries where product_type is "HUMAN PRESCRIPTION DRUG" with active ingredients ending in "vir" or containing a word ending in "vir". Almost all the active ingredients "vir" are anti-viral compounds. The script outputs a list of compound names and takes a few seconds to output 52 compounds.

  $ python3 filter_fda_drugs2.py drug-ndc-0001-of-0001.json
  abacavir sulfate
  acyclovir
  acyclovir sodium
  adefovir dipivoxil
  ...

Adding the word "class" as an option when running the filter script outputs the pharm_class (pharmacologic class) field which gives an idea of what the drug does, for example, "Neuraminidase Inhibitors [MoA]". The suffix [MoA] means "method of action" and "[EPC]" means "established pharmacologic action".

  $ python3 filter_fda_drugs2.py drug-ndc-0001-of-0001.json class
  abacavir sulfate	Cytochrome P450 1A1 Inhibitors [MoA], ...
  acyclovir	Herpes Zoster Virus Nucleoside Analog DNA Polymerase Inhibitor [EPC], ...
  ...

Filtering all FDA approved prescription compounds

The filtering used to produce 1508 compounds for the SARS-CoV-2 main protease predictions dropped the constraint that the compound name ends with "vir". Also compounds were looked up by name in PubChem to find SMILES strings and if no string was found, the compound was excluded. Also there are very large compounds, such as 30 amino-acid peptides that I wanted to exclude since I think predictions of those probably has lower accuracy. I also exclude very small ligands since they are unlikely to have specificity for a protein target. I used a dumb test filter_smiles.py counting the number of C characters in the SMILES string and require the count to be between 4 and 40. Unfortunately this includes chlorine in the counts. Requiring PubChem have the compound and have a SMILES string and the length limits reduced the 2863 FDA active ingredients to 1508.

Finding SMILES string descriptors for compounds

AlphaFold 3 can include small molecules in predictions using SMILES strings. I got the SMILES strings for our anti-viral compounds from PubChem using Python script pubchem_smiles2.py. I takes 20 seconds to run and fetches the SMILES strings for each of the 52 compounds virus_drugs_smiles.txt.

  $ python3 pubchem_smiles2.py < virus_drugs.txt
  abacavir sulfate, C1CC1NC2=C3C(=NC(=N2)N)N(C=N3)[C@@H]4C[C@@H](C=C4)CO.C1CC1NC2=C3C(=NC(=N2)N)N(C=N3)[C@@H]4C[C@@H](C=C4)CO.OS(=O)(=O)O
  acyclovir, C1=NC2=C(N1COCCO)N=C(NC2=O)N
  acyclovir sodium, C1=NC2=C(N1COCCO)N=C(N=C2[O-])N.[Na+]
  adefovir dipivoxil, CC(C)(C)C(=O)OCOP(=O)(COCCN1C=NC2=C(N=CN=C21)N)OCOC(=O)C(C)(C)C
  ...