How to optimise process/thread/GPU affinity, NUMA placement and topology-aware rank ordering

First: Understand Your System

Before optimising, check your hardware topology:

to show CPU sockets and NUMA nodes

numactl –hardware

to show which GPUs are closest to which CPU sockets

nvidia-smi topo -m

to display full system topology

lstopo

Look for which NUMA node each GPU is connected to – this is critical for GPU jobs.

Essential MPI Binding (OpenMPI)

Basic CPU binding:

mpirun –map-by socket –bind-to core ./my_app

This distributes MPI ranks across sockets and pins them to cores. Use this as your default.

For multi-GPU nodes (e.g., 4 GPUs on 2 sockets):

mpirun –map-by ppr:2:socket –bind-to socket ./my_app

This places 2 ranks per socket, matching the GPU layout.

Intel MPI equivalent:

mpirun -genv I_MPI_PIN_DOMAIN=socket ./my_app

Essential Slurm Settings

In your Slurm script:

#SBATCH –ntasks-per-node=4
#SBATCH –cpus-per-task=8
#SBATCH –gres=gpu:4
#SBATCH –cpu-bind=cores

srun –cpu-bind=cores ./my_app

The –cpu-bind=cores is crucial – it tells Slurm to bind tasks to specific cores.

For GPU Jobs

Key principle: Bind CPU processes to the same NUMA node as their GPU.

Check nvidia-smi topo -m to see GPU-CPU affinity, then use MPI mapping to match. For 4 GPUs (2 per socket):

mpirun -np 4 –map-by ppr:2:socket –bind-to socket ./gpu_app

Assign GPUs in your code based on local MPI rank, or set CUDA_VISIBLE_DEVICES appropriately.

For Hybrid MPI+OpenMP

Set these environment variables:

export OMP_NUM_THREADS=8
export OMP_PROC_BIND=close
export OMP_PLACES=cores

Then launch with:

mpirun –map-by socket –bind-to socket:pe=8 ./hybrid_app

The pe=8 gives each MPI rank 8 cores for its OpenMP threads.

Using numactl Directly

For single-process jobs or testing:

numactl –cpunodebind=0 –membind=0 ./my_app

Binds to NUMA node 0 (first socket).

For memory-bandwidth intensive work:

numactl –interleave=all ./my_app

Distributes memory across all NUMA nodes.

Quick Verification

Check if your binding worked:

numastat -p <pid>

should show memory mostly on one node if bound correctly

ps -eLo pid,psr,comm | grep my_app

shows which cores are being used.

Most Common Mistakes to Avoid

Not binding at all – the OS will move your processes around
Using all cores on one socket for a multi-GPU job when GPUs span both sockets
Forgetting –cpu-bind=cores in Slurm scripts
Not setting OMP_PROC_BIND for hybrid jobs

Bottom Line

For CPU-only MPI jobs:

–map-by socket –bind-to core

For GPU jobs:

–map-by ppr:N:socket

where N matches GPUs per socket. Always check first

nvidia-smi topo -m

In Slurm: Always add

–cpu-bind=cores

For hybrid: Set

OMP_PROC_BIND=close

and use

–bind-to socket:pe=N

Start with these basics and measure performance. Most gains come from just enabling binding – fine-tuning can come later.