Speeding up AlphaFold 3 MSA calculation

Tom Goddard
December 12, 2024

AlphaFold 3 predictions run locally from the github source code spend most of the time (> 80%) computing the multiple sequence alignments (MSA) using the program jackhmmer part of the hmmer package that uses CPU only. This page describes how to compute MSAs as fast as possible for running large numbers of predictions. The jackhmmer sequence search program was written long ago (published 2010) and is not optimized for modern computers. Using mmseqs2 would allow to running hundreds to millions of predictions much faster as ColabFold did for AlphaFold 2.

Fastest MSA calculation for large numbers of predictions

Computing the MSAs with jackhmmer for predicting thousands of structures using different sequences is resource intensive no matter how it is done. Jackhmmer is not well suited to this task, and it would be better to seek solutions that use faster sequence search such as mmseqs2.

But if jackhmmer is used the fastest approach on a cluster to calculate 10000 MSAs with 20 parallel compute jobs would probably be to first copy the 300-400 GB of databases from cluster network file system to local NVMe scratch drive then run 500 MSA calculations from that drive. A GPU will not be used so submitting these jobs on CPU nodes will avoid leaving GPU resources idle. The GPU computation of the structures can then be done in a separate step. The AlphaFold "--run_inference false" option will only compute the MSAs, and the "--run_data_pipeline false" option will only do the structure calculation using already computed MSAs.

Jackhmmer bottlnecks

There are 3 factors that can limit the speed of the jackhmmer sequence alignment calculations: disk bandwidth, single-core CPU speed, how many threads jackhmmer is allowed to use. Below shows timing on well balanced hardware where all 3 factors are optimized requires about 6 minutes for a 600 residue sequence to complete the MSA calculation using databases on an NVMe drive and a fast Intel i9-13900K processor with 24 cores.

AlphaFold 3 default jackhmmer use

A standard AlphaFold 3 run with a single protein sequence launches 4 parallel jackhmmer processes requesting 8 worker threads for each plus a master thread for each, so a total of 36 threads. Each process searches one database with database sizes 17, 67, 102, and 120 GB (for bfd-first_nonconsensus_sequences.fasta, uniref90_2022_05.fa, uniprot_all_2021_04.fa, mgy_clusters_2022_05.fa).

Optimal jackhmmer speed

With the i9-13900K 24 core + NVMe system each jackhmmer reads at about 700 MB/sec, so about 2.8 GB/sec total from the NVMe which is capable of 3 GB/sec. After the smallest search (17 GB) completes, the remaining 3 jackhmmer processes each use about 800% CPU, so 2400% total, about 24 of the 24 cores running a thread at full utilization. These cores can run two threads each using hyperthreading.

Fast single-core CPU speed helps

Running just a single jackhmmer process also reads only 700 MB/sec. It appears to be bottlenecked by the single jackhmmer thread which reads the database file and divides up the sequences and hands them to the 8 worker threads. Testing on another slower CPU (Xeon Gold 6226R CPU) with half the speed reads only 350 MB/sec. On that system only about 400% CPU usage is observed for each jackhmmer process, so for the largest 3 database searches it uses only the equivalent of 12 cores at full utilization.

Searching with longer query sequence benefits from more cores

Running a jackhmmer search on a length 1200 protein sequence still reads the database at 700 MB/sec. If the jackhmmer option "--cpu 16" is use to allocate 16 worker threads a single jackhmmer process can reach 1600% CPU and completes as fast as a length 600 search but with the doubled CPU usage.

Longer query sequences take more GPU time relative to MSA time

A sequence of length 2377 on the i9-13900K NVMe setup took 961 seconds to compute the MSA, 125 seconds to find structure templates, and 528 seconds on an Nvidia 4090 GPU for structure calculation, so MSA time is 1.8x structure calculation time. Length 600 took 293 seconds for MSA and 65 so MSA time was 4.5x structure calculation time. So for longer sequences the GPU structure calculation time increases much faster than the CPU MSA calculation time and optimization of MSA calculation will have diminshing benefit in reducing overall computation time.

AF3 needs more than 64 GB of memory for some sequences

Alphafold 3 runs four jackhmmer processes in parallel and those on the 3 large database reach 30-40 GB resident memory use on a sequence that has many homologs, for example, UniProt ABCG2_HUMAN. On a machine with 64 GB of memory and little swap space (e.g. default on Ubuntu is 8 GB swap), one of the jackhmmer jobs will run out of memory and crash terminating the Alphafold prediction. Such sequences would need 128 GB of memory, or a small change to the AlphaFold code to run the jackhmmer processes sequentially instead of in parallel.

Example AlphaFold 3 run times

Timing AlphaFold 3 prediction speed for Nipah virus G protein (PDB 8xps chain A, 602 amino acids) with two zanamivir ligands (chemical component dictionary ZMR). Having the sequence databases on a local NVMe drive (3 GB/sec read speed) is far faster than an SSD drive using SATA III connection (0.6 MB/sec) or a network file system (e.g. BeeGFS used on a compute cluster).

Disk	MSA time seconds	Nvidia GPU	Inference time seconds	Machine	Notes
NVMe	293	4090	65	minsky.cgl.ucsf.edu	CUDA 12.4, i9-13900K (8 performance cores, 16 efficiency cores)
SSD sata 3	757	4090	73	minsky
SSD sata 3	726	3090	104	quillian.cgl.ucsf.edu	CUDA 12.6, i9-9900KF (8 cores)
BeeGFS	1049	A40	124	Wynton cluster qb3-atgpu30	CUDA 12.4, AMD EPYC 7543P (32 cores)

RAM disk

I tried using a 500 GB RAM disk (linux tmpfs) with the 4 AF3 protein sequence databases (total size 305 GB) and timed the AlphaFold 3 MSA calculation.

Disk	MSA time seconds	Machine
BeeGFS run1	543	wilkins.cgl.ucsf.edu, dual Intel Xeon Gold 6226R (total of 32 cores)

Surprisingly this is slower than the NVMe drive test above. The RAM disk read speed tested by copying a 17 GB database file from RAM disk to /dev/null was 7 GB/sec, so more than twice as fast as the NVMe drive. But the CPU in this RAM disk test is slower by a factor of 2 than the CPU in the NVMe test.

Operating system memory disk cache

I tried running the same MSA calculation 3 times in a row on wilkins back-to-back (from a single script). The machine appeared to have no load except my job and 1 TB of memory. It did not appear that much file caching occurred with run times of 1891, 1490, 1311 seconds. The top command showed the jackhmmer sequence search CPU usage typically under 200% even though it is using 8 threads and the minsky NVMe test shows 800% consistently.

Disk	MSA time seconds	Machine
BeeGFS run1	1891	wilkins.cgl.ucsf.edu, dual Intel Xeon Gold 6226R (total of 32 cores)
BeeGFS run2	1490	wilkins
BeeGFS run3	1311	wilkins

Times for larger predictions

Timing AlphaFold 3 prediction speed for cytosine methyltransferase protein DNMT5 witn 2377 residues and dimers with this protein.

System	Residues	Disk	MSA time seconds	Template search	Nvidia GPU	GPU memory	Inference time seconds	Machine	Notes
DNMT5	2377	NVMe	961	157	4090	24 GB	523	minsky	CUDA 12.4, i9-13900K (8 performance cores, 16 efficiency cores)
DNMT5, AFR92496	2808 2377 + 431	NVMe	1273	170	4090	24	Failed	minsky	Failed allocating 26.7 GB on GPU
DNMT5, AFR92703	3176 2377 + 799	NVMe	1332	182	4090	24	Failed	minsky	Failed allocating 36 GB on GPU
DNMT5, AFR92496	2808 2377 + 431	BeeGFS	2068	328	A40	48	1746	Wynton qb3-atgpu30	CUDA 12.4, AMD EPYC 7543P (32 cores)
DNMT5, AFR92703	3176 2377 + 799	BeeGFS	2952	397	A40	48	2460	Wynton qb3-atgpu14	CUDA 12.4
DNMT5, AFR95020	4330 2377 + 1953	BeeGFS	4098	403	A40	48	10900	Wynton qb3-atgpu2	CUDA 12.4

Can Jackhmmer code changes speed up database read?

Jackhmmer is reading rather small data chunks 4096 bytes long. I tried reading 64 KB, 1 MB, 1 GB chunks at a time and none of these made any difference in the time to complete a search using a 17 GB database and length 600 sequence. Jackhmmer reads 1000 sequences at a time to hand to a worker thread. I tried increasing that to 10000 or 100000 and it made no difference in the runtime using a 17 GB database and length 600 sequence. It appears the jackhmmer code is parsing the sequences into individual data structures in the disk-reading thread and that is the bottleneck slowing down the reading of database sequences and limiting the search speed.

Notes

I learned a great deal today about the AlphaFold 3 speed bottleneck computing the multiple sequence alignment (MSA). AlphaFold 3 uses a sequence search program jackhmmer, a 15 year old program that is not optimized like the newer sequence search program mmseqs2. AlphaFold 2 also used jackhmmer. It turns out jackhmmer is not limted by disk I/O read speed of the 100 GB sequence database files, it is bottlenecked by CPU compute speed. For this reason, the RAM disk will not speed up jackhmmer. We were pursuing the RAM disk approach for ColabFold (the optimized AlphaFold 2) because it uses mmseqs2 which is bottlenecked by disk speed. I think future faster AlphaFold 3 will also use mmseqs2 for the sequence search.

Today's test showed that our desktop minsky.cgl.ucsf.edu is well balanced between disk speed and compute speed to do AlphaFold 3 predictions with jackhmmer as fast as possible and this is by far the fastest hardware we have for AF3. AF3 starts 3 jackhmmer processes each using 9 threads and gets about 700% CPU utilization on each so about 2100% total which matches minsky's 24 cores well. Each jackhmmer process consumes about 700 MB/sec, so a total of 2.1 GB/sec which is close to the peak read speed of the NVMe drive (3 MB/sec).

Wynton is also fairly well balanced between beegfs disk speed and compute speed, but on average about 4x slower than minsky because 1) the Wynton GPUs are 2x slower than minsky (wilkins.cgl.ucsf.edu has Intel Xeon Gold 6226R CPU (2x 16 cores) while our Wynton qb3-atgpu30 4 A40 node has AMD EPYC 7543P (32 cores) versus minsky Intel i9-13900K CPU (24 cores) and PassMark benchmarking ranks minsky single-thread twice as fast as wilkins and 1.7x faster than the A40 box), and 2) with twice as slow CPU speed the jackhmmer main thread can only take in 400 MB/sec in disk I/O. The result is Wynton can only utilize about 400% CPU on each jackhmmer job so a total of 1200% or about 12 cores. But the Wynton compute nodes are shared and although you can request some number of cores there is no enforcement, so other jobs may use all the cores on the machine. I think this is what leads to variable AF3 run times that can vary by a factor of 10x in duration because you might only get to use 1 core on average or you might get 10 cores. Wynton support says the nodes are "oversubscribed" meaning you almost surely will not get 10 cores even if you request that number.

In conclusion I think we should be looking at whether the ColabFold developers have made an mmseqs2 sequence search for AlphaFold 3 in order to enable predictions that could be about 5x faster.