Tom Goddard
goddard@sonic.net
February 5, 2024
Here is how UCSF researchers can run ColabFold 1.5.5 on the UCSF Wynton cluster. You will need a Wynton account.
ColabFold is an optimized version of AlphaFold that runs about 5 to 10 times faster than AlphaFold. The quality of the predictions is similar to AlphaFold although different sequence databases are used. A single ColabFold run can predict multiple structures, single proteins, and complexes.
We'll predict the structure of a small heterodimer (PDB 8R7A) of a rice protein with a rice pathogen protein.
First we create a fasta file with the sequences of the two proteins, sequences.fasta that looks like
>8R7A MGVLDSLSDMCSLTETKEALKLRKKRPLQTVNIKVKMDCEGCERRVKNAVKSMRGVTSVAVNPKQSRCTVTGYVEASKV LERVKSTGKAAEMWPYVPYTMTTYPYVGGAYDKKAPAGFVRGNPAAMADPSAPEVRYMTMFSDENVDSCSIM: MKCNNIILPFALVFFSTTVTAGGGWTNKQFYNDKGEREGSISIRKGSEGDFNYGPSYPGGPDRMVRVHENNGNIRGMPP GYSLGPDHQEDKSDRQYYNRHGYHVGDGPAEYGNHGGGQWGDGYYGPPGEFTHEHREQREEGCNIM
The name of the complex after the > will be used in the output file names so good to keep it short. When predicting a dimer the two sequences are separated by a ":". For a homodimer the same sequence would be repeated. Multimers can have as many sequences as needed separated by ":". For a single protein there would be one sequence and no ":". To predict more than one structure add more > lines with more sequences.
$ scp sequences.fasta log1.wynton.ucsf.edu $ ssh log1.wynton.ucsf.edu $ mkdir 8r7a $ mv sequences.fasta 8r7a $ cd 8r7a
The deep sequence alignments take a minute and are created on a cloud ColabFold server (located in South Korea). You can use the colabfold_batch installed in my home directory as done here or install your own copy of localcolabfold using the instructions on the github page.
$ export PATH=/wynton/home/ferrin/goddard/localcolabfold/localcolabfold/colabfold-conda/bin:$PATH $ colabfold_batch sequences.fasta . --msa-only >& msa.out
Make a copy of the launcher shell script run.sh and then submit it to the Wynton queue. This sets options and runs colabfold_batch as explained below.
$ qsub run.sh
To check if the job has started use the Wynton qstat command. An "r" in the state column means running, a "qw" means it is still waiting to run. If qstat gives no output then the job finished.
$ qstat job-ID priority name user state submit/start at queue slots ja-task-ID -------------------------------------------------------------------------------------------------------- 9903379 0.14275 colabfold goddard r 02/05/2024 17:37:46 gpu.q@qb3-atgpu10 1
This job took 2.5 minutes to complete (296 residues). Prediction time for a structure of N residues is roughly (N/30)*(N/30) seconds for an Nvidia A40 GPU. Example run times are here.
When the prediction completes the directory will contain these files (8r7a.zip). ColabFold makes 5 predictions (files *.pdb) based on 5 differently trained neural networks. The names "model_1, model_2, ..., model_5" refer to those 5 networks, and the file names also contain "rank_001", ..., "rank_005" indicating how well each model scored. The *.json files give predicted aligned error for each model which can be viewed in ChimeraX.
8R7A.a3m 8R7A.done.txt 8R7A.pickle 8R7A_coverage.png 8R7A_env/ 8R7A_pae.png 8R7A_pairgreedy/ 8R7A_plddt.png 8R7A_predicted_aligned_error_v1.json 8R7A_scores_rank_001_alphafold2_multimer_v3_model_4_seed_000.json 8R7A_scores_rank_002_alphafold2_multimer_v3_model_1_seed_000.json 8R7A_scores_rank_003_alphafold2_multimer_v3_model_2_seed_000.json 8R7A_scores_rank_004_alphafold2_multimer_v3_model_3_seed_000.json 8R7A_scores_rank_005_alphafold2_multimer_v3_model_5_seed_000.json 8R7A_unrelaxed_rank_001_alphafold2_multimer_v3_model_4_seed_000.pdb 8R7A_unrelaxed_rank_002_alphafold2_multimer_v3_model_1_seed_000.pdb 8R7A_unrelaxed_rank_003_alphafold2_multimer_v3_model_2_seed_000.pdb 8R7A_unrelaxed_rank_004_alphafold2_multimer_v3_model_3_seed_000.pdb 8R7A_unrelaxed_rank_005_alphafold2_multimer_v3_model_5_seed_000.pdb cite.bibtex colabfold.e9903379 colabfold.o9903379 config.json log.txt msa.out run.sh* sequences.fasta
The standard output and error are in files named with the job id and colabfold_batch writes a log.txt file.
colabfold.e9903379 colabfold.o9903379 log.txt
#!/bin/sh #$ -S /bin/sh #$ -q gpu.q #$ -N colabfold #$ -cwd #$ -l h_rt=08:00:00 #$ -l mem_free=60G #$ -l scratch=50G #$ -l compute_cap=80,gpu_mem=40G # Specify which GPU to use. echo "Wynton assigned GPU" $SGE_GPU export CUDA_VISIBLE_DEVICES=$SGE_GPU # Add the path to the colabfold_batch executable export PATH=/wynton/home/ferrin/goddard/localcolabfold/localcolabfold/colabfold-conda/bin:$PATH exec colabfold_batch --num-recycle 3 sequences.fasta .
The "#$" comments at the top specify options to the qsub command: what interpreter to run on this script (/bin/sh), which Wynton queue to use (gpu.q), what to name the job (colabfold), start the job in current directory, what maximum time to allow the job to run (8 hours), how much memory to request, how much scratch disk space is requested, and what kind of GPU (compute capability 80) and how much memory must the GPU have (40 GBytes). The GPU settings usually get an Nvidia A40 GPU with 48 Gbytes of memory capable of predicting structures up to 4700 amino acids.
The CUDA_VISIBLE_DEVICES environment variable is set so that colabfold_batch only uses the GPU that Wynton has allocated to it.
The directory where colabfold_batch is found to the executable search path.
Then colabfold_batch is run on our sequences file without output files in ".", the current directory.
The deep sequence alignments are done in an initial step above because the Wynton compute nodes do not have access to the internet, while the Wynton login node does have access to the internet. Since the alignments are done on a cloud server it needs internet access. Leaving out that step will cause the prediction job to fail.