How to optimise process/thread/GPU affinity, NUMA placement and topology-aware rank ordering
Why This Matters
Modern HPC nodes have multiple CPU sockets and NUMA domains. Poor placement causes 2–3x slower memory access and reduced GPU bandwidth. Proper binding ensures processes stay on the right cores with local memory and nearby GPUs.
First: Understand Your System
Before optimising, check your hardware topology:
- to show CPU sockets and NUMA nodes
numactl –hardware
- to show which GPUs are closest to which CPU sockets
nvidia-smi topo -m
- to display full system topology
lstopo
Look for which NUMA node each GPU is connected to – this is critical for GPU jobs.
Essential MPI Binding (OpenMPI)
Basic CPU binding:
mpirun –map-by socket –bind-to core ./my_app
This distributes MPI ranks across sockets and pins them to cores. Use this as your default.
For multi-GPU nodes (e.g., 4 GPUs on 2 sockets):
mpirun –map-by ppr:2:socket –bind-to socket ./my_app
This places 2 ranks per socket, matching the GPU layout.
Intel MPI equivalent:
mpirun -genv I_MPI_PIN_DOMAIN=socket ./my_app
Essential Slurm Settings
In your Slurm script:
#SBATCH –ntasks-per-node=4
#SBATCH –cpus-per-task=8
#SBATCH –gres=gpu:4
#SBATCH –cpu-bind=cores
srun –cpu-bind=cores ./my_app
The –cpu-bind=cores is crucial – it tells Slurm to bind tasks to specific cores.
For GPU Jobs
Key principle: Bind CPU processes to the same NUMA node as their GPU.
Check nvidia-smi topo -m to see GPU-CPU affinity, then use MPI mapping to match. For 4 GPUs (2 per socket):
mpirun -np 4 –map-by ppr:2:socket –bind-to socket ./gpu_app
Assign GPUs in your code based on local MPI rank, or set CUDA_VISIBLE_DEVICES appropriately.
For Hybrid MPI+OpenMP
Set these environment variables:
export OMP_NUM_THREADS=8
export OMP_PROC_BIND=close
export OMP_PLACES=cores
Then launch with:
mpirun –map-by socket –bind-to socket:pe=8 ./hybrid_app
The pe=8 gives each MPI rank 8 cores for its OpenMP threads.
Using numactl Directly
For single-process jobs or testing:
numactl –cpunodebind=0 –membind=0 ./my_app
Binds to NUMA node 0 (first socket).
For memory-bandwidth intensive work:
numactl –interleave=all ./my_app
Distributes memory across all NUMA nodes.
Quick Verification
Check if your binding worked:
should show memory mostly on one node if bound correctly
ps -eLo pid,psr,comm | grep my_app
shows which cores are being used.
Most Common Mistakes to Avoid
- Not binding at all – the OS will move your processes around
- Using all cores on one socket for a multi-GPU job when GPUs span both sockets
- Forgetting –cpu-bind=cores in Slurm scripts
- Not setting OMP_PROC_BIND for hybrid jobs
Bottom Line
For CPU-only MPI jobs:
–map-by socket –bind-to core
For GPU jobs:
–map-by ppr:N:socket
where N matches GPUs per socket. Always check first
nvidia-smi topo -m
In Slurm: Always add
–cpu-bind=cores
For hybrid: Set
OMP_PROC_BIND=close
and use
–bind-to socket:pe=N
Start with these basics and measure performance. Most gains come from just enabling binding – fine-tuning can come later.