Benchmarking and using AlphaFold 3 on Leonardo

Author: Leonardo Salicari (CINECA)

Introduction

All the results presented in this document use AlphaFold 3 (AF3, hereafter) version 3.0.1 [1] within a Singularity container, already provided on Leonardo through a module.

Given protein and/or nucleic acid sequences, AlphaFold 3 predicts their corresponding three-dimensional atomic structures. AF3 predictions can be divided into two stages. Following AF3 terminology, the first is the data pipeline, in which Multiple Sequence Alignments (MSAs) are compiled, template searches are performed, and other operations required to generate the model inputs are executed. The subsequent step is the inference pipeline, where these inputs are processed by the Machine Learning model to predict polypeptide or DNA/RNA structures.

From a performance perspective, the data pipeline represents the main bottleneck of the entire prediction workflow. The primary software used for biological sequence analysis is the HMMER suite [2] (jackhmmer and nhmmer), which presents several intrinsic limitations that will be discussed below.
Alternative software tools exist; however, it is not advised to replace the HMMER routines with other tools because this may reduce the correlation between AF3 model confidence and prediction accuracy, since AF3 weights were trained using HMMER-derived inputs [3].

For this reason, the present document reports a set of benchmarks on the data pipeline and provides practical suggestions for efficiently exploiting HPC resources when running AF3 inference. In addition, further insights are presented for AF3 runs performed on the CINECA Leonardo cluster.

 

Data pipeline and Slurm flags

Performance improvements rely on the multi-threading implementation of the HMMER binaries. However, these implementations scale poorly beyond two cores per process, as shown below. MPI versions of both nhmmer and jackhmmer have been developed; however, preliminary tests on Leonardo show no improvements with respect to the shared-memory versions for large inputs.

When requesting resources from Slurm for a data pipeline, two flags must be set and carefully controlled.

The first is ntasks-per-node, which determines the number of processes spawned per node. During the data pipeline, AF3 spawns a set of nhmmer and jackhmmer processes per chain in order to complete the databases search.

The second flag is cpus-per-task, which controls the number of CPU cores available per process (task). The number of cores reserved for each nhmmer/jackhmmer process is controlled by the AF3 flags nhmmer_n_cpu and jackhmmer_n_cpu, respectively. Therefore, if four cores are reserved with the Slurm directive --cpus-per-task=4, then the AF3 call should include --nhmmer_n_cpu=$SLURM_CPUS_PER_TASK and --jackhmmer_n_cpu=$SLURM_CPUS_PER_TASK, exploiting the corresponding Slurm environment variable.

Data pipeline benchmark results on Leonardo Booster

For these tests, the input consists of a protein complex composed of two chains, each formed by several hundred residues (a modified structure derived from PDB 8SOS). All results refer to runs performed on the Leonardo Booster partition.

‘Booster’ is an accelerated partition of Leonardo cluster which is equipped with custom NVIDIA Ampere100 with 64GiB of memory (VRAM) [4][5].

The following table reports varying Slurm parameters, and therefore allocated resources, for the same input. “Data Walltime” and “Inference Walltime” indicates the walltime of the data and inference (50 diffusion samples) pipelines, respectively. All runs were performed with the --exclusive directive and using 1 GPU per node.

 

nodes ntasks-per-node cpus-per-task Data Walltime [s] Inference Walltime (50 diff. samples) [s]
1 1 2 976.85 147.40
1 2 1 1395.27 148.19
1 2 2 992.39 147.87
1 2 4 983.21 148.04
1 4 2 978.27 148.09

 

One can observe that the inference walltime is not affected by changing the number of tasks and cpus, because the latter runs exclusively on the GPU. Another observation is that the HMMER suite benefits from multithreading: by increasing the number of cores from 1 to 2, a speedup of approximately ×1.40 is observed. However, using more than two cores per task does not provide further improvements.

Since AF3 databases are read concurrently by jackhmmer or nhmmer threads, this might limit the parallel I/O performances. Leonardo uses the Lustre filesystem. The filesystem can stripe a file or directory, thus data can be distributed, enabling parallel access to different sections of the data. This can improve the I/O performance. For more details, please refer to CINECA’s documentation on Lustre. The user can set the stripe properties of a directory such that every new file created inside will be striped. As a test case, a directory named public_databases can be set to have 8 stripes with a stripe size of 4Mb, using lfs setstripe -c 8 -S 4M public_databases. Then, downloading the AF3 databases to the public_databases folder, using the provided script, will stripe these accordingly (note: the databases require ~250Gb for the download and ~630Gb of storage once uncompressed). The following table shows the improvement to access the striped dataset (neglecting inference time):

 

nodes ntasks-per-node cpus-per-task Data Walltime [s]
1 2 2 868
1 4 2 870

 

Striping provides a 100s improvement over non-striped databases. Striped AF3 databases are already provided to the user in Leonardo through the alphafold/3.0.1 module.

The following table shows the walltime required to compute different numbers of diffusion samples (controlled by the AF3 flag --num_diffusion_samples):

 

nodes ntasks-per-node cpus-per-task Diffusion Samples Data Walltime [s] Inference Walltime (50 diff. samples) [s]
1 4 2 1 976.85 40.37
1 4 2 10 975.25 62.08
1 4 2 50 988.05 148.00

 

In this case, the data walltime is expected to fluctuate around an average value (979.605 s with standard deviation5 s, including the run from the previous table too) because the MSA and other inputs are computed only once for all diffusion samples. Furthermore, we can see that the inference walltime does not scale linearly with the number of diffusion samples. This is most likely because the model re-uses some parts of the first diffusion sample.

Leonardo Booster: Inference time as a function of tokens

The figure below shows the AF3 inference walltime as a function of the input sequence size, measured in tokens. The plot compares predictions performed on Leonardo Booster, which is equipped with custom NVIDIA A100 GPUs with 64 GiB of memory [5:1], and the benchmarks reported by DeepMind using an A100 GPU with 80 GiB of memory [6], in both cases using a single GPU per prediction.

The figure shows the AF3 inference walltime as a function of the input sequence size, measured in tokens.

 

The difference between the two configurations does not exceed a few percent, indicating good performance on Booster as the input size increases. Inputs only up to 4000 tokens (approximately 1330 residues) were considered due to the 64 GiB memory limitation. This constraint can be partially mitigated by enabling a unified memory approach, which allows memory to be shared between the host and the device. The outlier at approximately 7250 tokens represents an example of this approach (no comparable cases are reported in the DeepMind benchmarks). In this case, the degradation in performance is clearly visible due to the communication overhead introduced by host-to-device and device-to-host data transfers. This demonstrates that, although unified memory enables larger inputs, it introduces a significant performance penalty.

A natural question is whether a multi-GPU setup could be used for inference in order to exploit fast device-to-device communication. Currently, AF3 developers [7] do not support such configuration and do not plan to introduce it. Therefore, a single inference run is limited to a single GPU.

More than one prediction per GPU?

Using the default settings of AF3, inferring a structure allocates most of the available GPU memory for the model. This behavior is controlled by jax environment variables [8]:

XLA_PYTHON_CLIENT_PREALLOCATE=true
XLA_CLIENT_MEM_FRACTION=0.95

These parameters instruct XLA to pre-allocate 95% of the available GPU memory. Therefore, to allow multiple predictions to run on a single GPU, the value of XLA_CLIENT_MEM_FRACTION must be scaled accordingly.

On Leonardo Booster, requesting 1 GPU with --gres=gpu:1 implies that one quarter of the node is reserved; consequently, 8 CPU cores and 128 GiB of RAM are also allocated. For this reason, in order to efficiently exploit all reserved CPU cores while taking advantage of the HMMER multithreading implementation, it is advisable to spawn multiple tasks per node while assigning 2 cores per task. In practice, a job script may start with:

#SBATCH –nodes=1
#SBATCH –ntasks-per-node=4
#SBATCH –cpus-per-task=2
#SBATCH –gres=gpu:1

and include two AF3 calls in which resources are distributed using srun (i.e., separate job steps). This configuration remains reasonable because the data pipeline walltime for the --ntasks-per-node=4, --cpus-per-task=2 allocation and the --ntasks-per-node=2, --cpus-per-task=4 allocation are comparable (see the previous table).

The main drawback of this configuration is the limitation on input size. On Booster, this would approximately halve the maximum number of tokens acceptable for a prediction, reducing the limit from about 4000 to approximately 2000 tokens. Consequently, this optimization may be useful only for relatively small complexes.

Sbatch script example to run the whole AF3 pipeline on Leonardo Booster for multiple inputs

The rationale is to spawn as many predictions as possible per node while maximizing performance for both the data and the inference pipelines.

The following is an example sbatch script to spawn multiple, concurrent, and independent AF3 predictions (entire pipeline) on Leonardo Booster. In this case, Slurm job arrays are used to control jobs submissions. The flag --array=0-999%50 submits 1000 jobs, indexed from 0 to 999, with only 50 concurrent jobs scheduled at any given time. Although Leonardo Booster users have access to larger array sizes [9], 50 is generally considered a safe value to avoid overloading the scheduler.

In this example, we use an input_list.txt file that lists the input names. The file has the following structure:

[username@login01 AF3_dir]$ head -n 3 input_list.txt
system_9EN2.json
system_6LR7.json
system_7Z1X.json

These correspond to the typical .json inputs of AF3 stored in an INPUT_DIR directory as:

[username@login01 AF3_dir]$ tree inputs | head -n 4
inputs
├── system_8SOS.json
├── system_4X7F.json
├── system_8TW3.json

Therefore, the directory content is:

[username@login01 AF3_dir]$ ls
inputs/ slurm/ input_list.txt sbatch_script.sh

 

ℹ️ Note

On Leonardo, the AF3 module provides the user with all public databases; however, it does not provide AF3 weights due to their restrictive license [10]. Therefore, the user is responsible for downloading them and making them discoverable by the container.

 

The following sbatch script (sbatch_script.sh) is a minimal working example, containing FIXME comments where the user must modify the indicated values:

#!/bin/bash -e

#SBATCH --job-name=af3_jobarray # FIXME
#SBATCH --account=your_account # FIXME
#SBATCH --partition=boost_usr_prod

# These are the resources PER JOB in the array
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=2
#SBATCH --gres=gpu:1
#SBATCH --time=01:00:00 # FIXME: depends on the input sizes

#SBATCH --array=0-999%50 # FIXME

#SBATCH --output=slurm/%x-%A_%a.out # job name - job id - array id
#SBATCH --error=slurm/%x-%A_%a.err
#SBATCH --mail-user=your_email_address # FIXME
#SBATCH --mail-type=FAIL,END # related to the *whole job* not the single array_job (hence, you don't get spam)

set -euo pipefail

# Load required modules to execute AF3 on Leonardo
ml purge
ml profile/bioinf alphafold/3.0.1

# ============================================================================
# SETUP
# ============================================================================
mapfile -t INPUT_LIST < input_list.txt
# In Leonardo, the user is responsable to request, download 
# and upload AF3 weights due to license limitations
ALPHAFOLD_PRIVATE_MODELS="path/to/my/weights" # FIXME
INPUT_DIR="path/to/inputs" # FIXME

# it is suggested to add an identifier of the job (e.g. input name, task id, etc.)
RUN_DIR="path/to/output" # FIXME
mkdir -p "$RUN_DIR"
cp ${INPUT_DIR}/${INPUT_LIST[$SLURM_ARRAY_TASK_ID]} ${RUN_DIR}/fold_input.json

# ============================================================================ 
# RUN AF3 WHOLE PIPELINE 
# ============================================================================
set +e

singularity run --nv \
    -B $ALPHAFOLD_PRIVATE_MODELS:/root/models \
    -B $ALPHAFOLD_PUBLIC_DB:/root/public_databases \
    -B /leonardo/prod/opt \
    -B $TMPDIR:/tmp \
    -B "$RUN_DIR:/output_dir" \
    -B "$RUN_DIR:/input_dir" \
    $ALPHAFOLD_IMAGE \
    python3 /app/alphafold/run_alphafold.py \
    --model_dir=/root/models \
    --db_dir=/root/public_databases \
    --json_path=/input_dir/fold_input.json \
    --output_dir="/output_dir" \
    --jackhmmer_n_cpu=$SLURM_CPUS_PER_TASK \
    --nhmmer_n_cpu=$SLURM_CPUS_PER_TASK \
    2>&1 | tee "$RUN_DIR/af3.log"

All environment variables that are not explicitly defined in the sbatch script are provided by the AF3 module (e.g. ALPHAFOLD_PUBLIC_DB or ALPHAFOLD_IMAGE). All outputs will be stored in the RUN_DIR directory. To avoid data race conditions among jobs, it is recommended to choose a path that is unique for each job, for example RUN_DIR="./out/${SLURM_JOB_ID}_${SLURM_ARRAY_TASK_ID}".

Separate data and inference pipelines

The previous sbatch script runs the entire pipeline on Leonardo Booster. A best practice is to separate the data pipeline and the inference pipeline, executing them on Leonardo DCGP and Leonardo Booster, respectively. This approach avoids leaving the GPU idle during the data pipeline—where the accelerator is not used—and takes advantage of the higher CPU core count available per node on DCGP.

The same script shown above can be reused for the data pipeline by setting the AF3 flags --run_data_pipeline=True and --run_inference=False. When running only the inference stage (i.e., using --run_data_pipeline=False and --run_inference=True), the input must be the *_data.json file produced by the data pipeline and stored in RUN_DIR, rather than the original .json input file [11][12].

Takeaways

  1. If the project has an allocation on both the Booster and DCGP partitions on Leonardo, it is advisable to run the data pipeline on DCGP in order to exploit CPU-based parallelization. This can be achieved using the flags --run_data_pipeline=True and --run_inference=False. Subsequently, the inference stage can be executed on Booster to exploit GPU acceleration of the AF3 model (--run_data_pipeline=False and --run_inference=True). Note that the input for the inference pipeline is the *_data.json output produced by the data pipeline [12:1].
  2. The Slurm directive --cpus-per-task and the AF3 flags --jackhmmer_n_cpu / --nhmmer_n_cpu must be consistent, for example: --jackhmmer_n_cpu=$SLURM_CPUS_PER_TASK.
  3. For the data pipeline, the HMMER suite performs best when using 2 cores per task; therefore, --cpus-per-task=2 is recommended. On the other hand, --ntasks-per-node can be set both to 2 or 4. If you request a GPU on Booster, it is advise to choose 4.

Final remarks

Version 3.0.1 was released in the beginning of 2025. Several improvements have been introduced in the project since this tag release, such as improved handling of host memory usage [13] and better default assignment of CPU cores to the HMMER suite. Therefore, users are encouraged to check the main branch of the project [1:1] to verify whether a particular improvement may be relevant to their workflows. A new tagged release is scheduled for April 2026 [14]. When it will become available, the corresponding update will be deployed on the Leonardo cluster.

Finally, for additional recommendations on optimizing AF3 prediction pipelines, users are encouraged to refer to the performance documentation available in the AF3 repository [8:1].

References

  1. AF3 GitHub repo for tag 3.0.1

  2. HMMER GitHub repository

  3. “We recommend using Jackhmmer/Nhmmer for best accuracy, especially in cases where the MSA is shallow.”, link to GitHub issue.

  4. A document describing Leonardo architecture can be found here

  5. CINECA’s documentation on Leonardo

  6. Deep Mind benchmarks

  7. Link to the issue

  8. GPU memory allocation is controlled by jax, hence the XLA compiler that translate the model for the GPU used (NVIDIA or AMD). For more details see the performance documentation.

  9. See scontrol show config | grep MaxArraySize

  10. Weights license

  11. Documentation on the output format

  12. Overview by the AF3 developer on how to separate data from inference

  13. Commit of the improvement

  14. AF3 will recieve a new tag release