How to optimise process/thread/GPU affinity, NUMA placement and topology-aware rank ordering

Why This Matters

Modern HPC nodes have multiple CPU sockets and NUMA domains. Poor placement causes 2–3x slower memory access and reduced GPU bandwidth. Proper binding ensures processes stay on the right cores with local memory and nearby GPUs.

First: Understand Your System

Before optimising, check your hardware topology:

  • to show CPU sockets and NUMA nodes

numactl –hardware

  •  to show which GPUs are closest to which CPU sockets

nvidia-smi topo -m

  • to display full system topology

lstopo

Look for which NUMA node each GPU is connected to – this is critical for GPU jobs.

Essential MPI Binding (OpenMPI)

Basic CPU binding:

mpirun –map-by socket –bind-to core ./my_app

This distributes MPI ranks across sockets and pins them to cores. Use this as your default.

For multi-GPU nodes (e.g., 4 GPUs on 2 sockets):

mpirun –map-by ppr:2:socket –bind-to socket ./my_app

This places 2 ranks per socket, matching the GPU layout.

Intel MPI equivalent:

mpirun -genv I_MPI_PIN_DOMAIN=socket ./my_app

Essential Slurm Settings

In your Slurm script:

#SBATCH –ntasks-per-node=4
#SBATCH –cpus-per-task=8
#SBATCH –gres=gpu:4
#SBATCH –cpu-bind=cores

srun –cpu-bind=cores ./my_app

The –cpu-bind=cores is crucial – it tells Slurm to bind tasks to specific cores.

For GPU Jobs

Key principle: Bind CPU processes to the same NUMA node as their GPU.

Check nvidia-smi topo -m to see GPU-CPU affinity, then use MPI mapping to match. For 4 GPUs (2 per socket):

mpirun -np 4 –map-by ppr:2:socket –bind-to socket ./gpu_app

Assign GPUs in your code based on local MPI rank, or set CUDA_VISIBLE_DEVICES appropriately.

For Hybrid MPI+OpenMP

Set these environment variables:

export OMP_NUM_THREADS=8
export OMP_PROC_BIND=close
export OMP_PLACES=cores

Then launch with:

mpirun –map-by socket –bind-to socket:pe=8 ./hybrid_app

The pe=8 gives each MPI rank 8 cores for its OpenMP threads.

Using numactl Directly

For single-process jobs or testing:

numactl –cpunodebind=0 –membind=0 ./my_app

Binds to NUMA node 0 (first socket).

For memory-bandwidth intensive work:

numactl –interleave=all ./my_app

Distributes memory across all NUMA nodes.

Quick Verification

Check if your binding worked:

numastat -p <pid>

should show memory mostly on one node if bound correctly

ps -eLo pid,psr,comm | grep my_app

shows which cores are being used.

Most Common Mistakes to Avoid

  • Not binding at all – the OS will move your processes around
  • Using all cores on one socket for a multi-GPU job when GPUs span both sockets
  • Forgetting –cpu-bind=cores in Slurm scripts
  • Not setting OMP_PROC_BIND for hybrid jobs

Bottom Line

For CPU-only MPI jobs:

–map-by socket –bind-to core

For GPU jobs:

–map-by ppr:N:socket

where N matches GPUs per socket. Always check first

nvidia-smi topo -m

In Slurm: Always add

–cpu-bind=cores

For hybrid: Set

OMP_PROC_BIND=close

and use

–bind-to socket:pe=N

Start with these basics and measure performance. Most gains come from just enabling binding – fine-tuning can come later.