Case study: Placement pingpong
Using a GPU node: correct placement of tasks
Modern HPC codes often rely on accelerators (GPUs) to speed up computations. On HPC systems, these accelerators are grouped in specific nodes. Often, these GPU nodes contain multiple GPUs to enable multi-GPU jobs within a single node.
The CPUs in these GPU nodes have multiple cores and multiple NUMA domains. These NUMA domains split the CPU into multiple parts, each part having its own memory.
Running multiple GPUs on one node reduces the need for inter-node communication (which is often slower) but requires attention to CPU topology and NUMA placement to avoid unnecessary cross-domain traffic and bandwidth bottlenecks.
Here, we show why the correct CPU-GPU placement matters and provide examples how to bind tasks to the CPUs and GPUs. The measurements and examples are taken from tests on LUMI-G and on Leonardo Booster. They illustrate the effect of CPU-GPU binding on host-device bandwidth and on total runtime. These results are specific to the hardware used for the tests and should not be blindly compared across different systems. We do not endorse one system or vendor over another.
LUMI-G
On LUMI-G, each node has 4 AMD GPUs connected to 4 NUMA domains. The figure below shows the connection between the GPUs and the NUMA domains on LUMI-G:
Overview of a LUMI-G compute node, © LUMI 2025
As you can see, each GPU is connected to a specific NUMA domain. On the figure, the orange package is one MI250X GPU with two GCD chips. As each GCD is working independently, we will treat them as separate GPUs.
Test Program
By binding the correct CPU to the GPU, you can gain performance on your code. We will investigate this behaviour with a small test program. In the test program below, an array is transferred to and from the GPU to the RAM.
The code is written in C++ and uses the HIP or CUDA API to perform the GPU transfers. HIP is the AMD equivalent of CUDA, which also works on NVIDIA GPUs.
The code can be found in the opencode.it4i.eu repository. Below, a simplified version of the code is shown.
#include <hip/hip_runtime.h>
#include <iostream>
int main(int argc, char** argv) {
int message_size = 50 * 1024 * 1024; // 200 MB - 50 Mi floats * 4 bytes
int repeats = 1000; // Number of data transfers
// Allocate host and device arrays
float* host_array;
float* device_array;
host_array = (float*)malloc(message_size * sizeof(float));
hipMalloc((void**)&device_array, message_size * sizeof(float));
for (int i = 0; i < 500000; i++) {
host_array[i] = static_cast<float>(i);
}
// Perform data transfers and time them
for (int i = 0; i < repeats; i++) {
hipMemcpy(device_array, host_array, message_size * sizeof(float), hipMemcpyHostToDevice);
hipMemcpy(host_array, device_array, message_size * sizeof(float), hipMemcpyDeviceToHost);
}
// Free memory
hipFree(device_array);
free(host_array);
return 0;
}
This code is further extended to work with MPI, where each tasks times the different data transfers. There is also an option to print additional information about the GPUs accessible to the program. We are mainly interested in the bandwidth of the data transfers. We repeat the transfers several times and calculate the average time taken for the transfers and the bandwidth achieved.
Executing the program
The program is started with the srun command on LUMI-G. The program tests the transfers of data from RAM (CPU) memory to GPU memory and back. We thus have three components that can influence the performance:
- the CPU where the data is located,
- the GPU where the data is transferred to,
- the NUMA domain where the data transfer originates from.
Each option can be set using specific binding options in the srun or other commands. Usually, the memory is allocated close to the CPU where the task is running by default, in the same NUMA domain.
srun provide several options to bind tasks to specific CPUs. Both specific CPUs can be chosen and a mask of the different CPUs in the system. Here, we will always bind specific CPUs to the tasks, as this gives the most control over the placement. In hybrid MPI+OpenMP codes, it is better to bind a set of CPUs to each MPI rank.
The specific placement of the CPUs on a machine can be investigated with the hwloc-ls command. It is good practice to investigate this before running the program to understand the CPU layout.
On LUMI-G, each node has 4 NUMA domains, each with 16 CPU cores. Furthermore, each NUMA domain is devided in two sets of 8 cores which share the L3 cache. On these L3-cached cores, one GPU is connected. One NUMA domain is connected to one of the MI250X, which contains two GPUs.
We bind a specific CPU to a task with the --cpu-bind option of srun and only allocate one core per task. The GPUs are chosen using the ROCM_VISIBLE_DEVICES environment variable. The ROCM name comes from the AMD ROCm software stack for GPUs, equivalent to NVIDIA’s CUDA.
This environment variable is set in a small wrapper script that is called before the actual program is started. This scripts allows us to vary the environment variable based on the different tasks
# set the CPU and GPU maps
cpu="49,57,17,25,1,9,33,41"
gpu="0,1,2,3,4,5,6,7"
# generate the GPU selection-script
cat << EOF > select_gpu
#!/bin/bash
GPU_LOC=(${gpu//,/ }) # convert the comma-separated list to a bash-array
export ROCM_VISIBLE_DEVICES=\${GPU_LOC[\$SLURM_LOCALID]} # bind each task to a specific GPU
exec \$*
EOF
# make the selection-script an executable
chmod +x select_gpu
# bind the right cores with "cpu_bind" and call the GPU-placement script first
srun -n $n_tasks --cpu-bind="map_cpu:$cpu" ./select_gpu ./pingpong
The script first copies the gpu variable into a local array it stores in the script. Then, based on the local task ID (SLURM_LOCALID), it sets the ROCM_VISIBLE_DEVICES variable to the correct GPU for that task. Finally, it executes the actual program. For CUDA programs on NVIDIA GPUs, the same can be done with the CUDA_VISIBLE_DEVICES variable.
Running the tests
We run the program with different combinations of CPU and GPU bindings.
Single task
The examples below will always run on an empty (exclusive) node on LUMI-G. We start with a single task (1 MPI rank) and 1 GPU to measure the maximum performance of the CPU-GPU transfers.
Placement on an empty node
When we run with a single task, we can bind a specific CPU to a GPU, we set for example
cpu="49"
gpu="0"
which outputs:
----------------------------------
- Check Placement
- ++++++++++++++++++++++++++++++ -
- GPU 0
- CPU 49
- SIZE 33554432
----------------------------------
//////////////////////////////////
// PING PONG // EPICURE // HIP ///
//////////////////////////////////
++++++++++++++++++++++++++++++++++
+ SETTINGS
+ SIZE: 33554432 +
+ REPS: 100 +
+ WARM: 10 +
+ POST: 1 +
+ INFO: 1 +
+ DATA +
+ TRANSFERING FLOATS OF SIZE 4 +
+ > 134217728.000 Bytes +
+ > 131072.000 KibiBytes +
+ > 128.000 MebiBytes +
+ > 0.125 GibiBytes +
++++++++++++++++++++++++++++++++++
==================================
[Rank 0] (Visible Devices: 1)
[Device 0]
Name: AMD Instinct MI250X
PCI Address: 0:c1:0
==================================
Elapsed times per rank:
[Rank 0]--------------------------
Full Transfer:
mean: 10.65665 ms
std: 0.00155 ms
min: 10.65522 ms
max: 10.66125 ms
bandwidth: 23.45952 GB/s
Host to Device:
mean: 5.30856 ms
std: 0.00123 ms
min: 5.30735 ms
max: 5.31162 ms
bandwidth: 23.54687 GB/s
Device to Host:
mean: 5.34809 ms
std: 0.00045 ms
min: 5.34747 ms
max: 5.34975 ms
bandwidth: 23.37281 GB/s
//////////////////////////////////
Total program time: 1.63923 sec
//////////////////////////////////
The copy from host to device seems to be a slightly faster than the copy from device to host, but the difference is minimal. The maximum bandwidth we achieve is roughly 23.5 GiB/s, which is lower than the theoretical maximum of the Infinity Fabric (36 GB/s). We are likely limited by the read speed of the CPU from RAM.
when we set a different GPU, i.e. "4", we get as output:
Full Transfer:
mean: 10.65981 ms
std: 0.00050 ms
min: 10.65885 ms
max: 10.66202 ms
bandwidth: 23.45257 GB/s
which is a bit slower, but within the margin of error.
Indeed, the performance is the same for both GPUs. As you can see in the figure below, there does not seem to be a difference in the connection of the GPUs to the NUMA domains on LUMI-G.
Bandwidth for different CPU-GPU combinations on an exclusive node, © EPICURE 2025.
Of course, in a real workload with multiple tasks, the placement will matter more as multiple tasks will compete for the same resources.
Bandwidth and message size
We can also investigate the influence of the size of the data transfers. Let’s first try a smaller data transfer, and start with 1 float (4 B) which gives as output:
Full Transfer:
mean: 0.01424 ms
std: 0.00102 ms
min: 0.01227 ms
max: 0.02086 ms
bandwidth: 0.00052 GB/s
This is very slow compared to the larger data transfers from above.
We can also look at 64 kiB, which gives as output:
Full Transfer:
mean: 0.02300 ms
std: 0.00036 ms
min: 0.02233 ms
max: 0.02382 ms
bandwidth: 5.30840 GB/s
and to the output of 128 kiB:
Full Transfer:
mean: 0.07108 ms
std: 0.00349 ms
min: 0.06867 ms
max: 0.08942 ms
bandwidth: 3.43477 GB/s
The performance for the 64 kiB is much better than for 1 float, but still much worse than for the large transfers of 128 MiB from previous section. For smaller data transfers, the overhead of starting the transfer dominates the total time.
However, when we look at 128 kiB, the performance goes down again. This is likely due to some internal caching mechanism in the GPU or the CPU.
For larger data transfers, up to a full GPU ~64 GiB, the performance remains roughly the same as for the 128 MiB transfer (output for 32 GiB):
Full Transfer:
mean: 2724.69147 ms
std: 0.02552 ms
min: 2724.64716 ms
max: 2724.85842 ms
bandwidth: 23.48890 GB/s
This maximum performance is probably the limit of the read spead of the CPU from RAM.
We can summarize this in a single figure:
Bandwidth for different message sizes, © EPICURE 2025.
As this figure has a sharp slope and then a flat line, the figure is similar to a roofline plot. First, the performance increases with increasing data size, until it reaches a maximum performance. After that, the performance remains roughly constant. At 64 GiB, there is a drop as we fill up the GPU memory and the GPU probably has to manage its memory more actively.
Multiple tasks
When we run with multiple tasks, we can bind specific CPUs to specific GPUs.
Correct placement
We start with the correct placement of the GPUs and the CPUs according to the figure at the start of this document. For this, we will bind task 0 to GPU 0 and CPU 49, task 1 to GPU 1 and CPU 57, … :
----------------------------------
- Correct Placement
- ++++++++++++++++++++++++++++++ -
- GPU 0,1,2,3,4,5,6,7
- CPU 49,57,17,25,1,9,33,41
- SIZE 67108864
----------------------------------
When we run the program with this setup, we get the following output:
Elapsed times per rank:
[Rank 0]--------------------------
Full Transfer:
mean: 28.37514 ms
std: 3.49444 ms
min: 21.79861 ms
max: 41.58415 ms
bandwidth: 17.62106 GB/s [...]
[Rank 1]--------------------------
Full Transfer:
mean: 28.03862 ms
std: 2.64210 ms
min: 23.06436 ms
max: 34.62048 ms
bandwidth: 17.83255 GB/s [...]
[Rank 2]--------------------------
Full Transfer:
mean: 27.26915 ms
std: 2.75464 ms
min: 21.56006 ms
max: 34.03537 ms
bandwidth: 18.33574 GB/s [...]
[Rank 3]--------------------------
Full Transfer:
mean: 28.45257 ms
std: 3.35085 ms
min: 22.39304 ms
max: 38.32263 ms
bandwidth: 17.57311 GB/s [...]
[Rank 4]--------------------------
Full Transfer:
mean: 30.58955 ms
std: 0.90828 ms
min: 27.17812 ms
max: 31.42603 ms
bandwidth: 16.34545 GB/s [...]
[Rank 5]--------------------------
Full Transfer:
mean: 30.63158 ms
std: 1.37599 ms
min: 26.13716 ms
max: 32.53253 ms
bandwidth: 16.32302 GB/s [...]
[Rank 6]--------------------------
Full Transfer:
mean: 30.57258 ms
std: 1.50466 ms
min: 26.30613 ms
max: 32.54328 ms
bandwidth: 16.35452 GB/s [...]
[Rank 7]--------------------------
Full Transfer:
mean: 30.61326 ms
std: 2.45280 ms
min: 22.09821 ms
max: 35.88030 ms
bandwidth: 16.33279 GB/s [...]
//////////////////////////////////
Total program time: 16.24633 sec
//////////////////////////////////
The performance is lower than in the single task case, but still acceptable. Notice that there are some variations in execution time for each task, sometimes the transfer is much slower. This is likely due to saturation of the data connections in the system being used by other tasks on the node.
Incorrect placement
When we run with an incorrect placement, e.g., binding task 0 to GPU 0 and CPU 1, etc.:
----------------------------------
- Incorrect Placement
- ++++++++++++++++++++++++++++++ -
- GPU 0,1,2,3,4,5,6,7
- CPU 1,9,17,25,33,41,49,57
- SIZE 67108864
----------------------------------
we get the following output:
[Rank 0]--------------------------
Full Transfer:
mean: 40.18984 ms
std: 2.92538 ms
min: 34.49587 ms
max: 46.41088 ms
bandwidth: 12.44095 GB/s [...]
[Rank 1]--------------------------
Full Transfer:
mean: 64.74852 ms
std: 3.50235 ms
min: 54.39005 ms
max: 72.53853 ms
bandwidth: 7.72218 GB/s [...]
[Rank 2]--------------------------
Full Transfer:
mean: 34.59955 ms
std: 3.55850 ms
min: 25.76855 ms
max: 41.37507 ms
bandwidth: 14.45105 GB/s [...]
[Rank 3]--------------------------
Full Transfer:
mean: 66.97608 ms
std: 3.50355 ms
min: 56.40968 ms
max: 74.42848 ms
bandwidth: 7.46535 GB/s [...]
[Rank 4]--------------------------
Full Transfer:
mean: 52.67028 ms
std: 2.56067 ms
min: 48.75788 ms
max: 56.77458 ms
bandwidth: 9.49302 GB/s [...]
[Rank 5]--------------------------
Full Transfer:
mean: 33.44494 ms
std: 4.02600 ms
min: 27.42386 ms
max: 37.74904 ms
bandwidth: 14.94994 GB/s [...]
[Rank 6]--------------------------
Full Transfer:
mean: 52.69573 ms
std: 2.72296 ms
min: 48.81179 ms
max: 56.68344 ms
bandwidth: 9.48844 GB/s [...]
[Rank 7]--------------------------
Full Transfer:
mean: 33.44124 ms
std: 4.01787 ms
min: 27.42662 ms
max: 37.71745 ms
bandwidth: 14.95160 GB/s [...]
//////////////////////////////////
Total program time: 26.03712 sec
//////////////////////////////////
The performance is significantly lower than in the correct placement case. The total program time is 10 seconds longer, which is a significant increase. This shows the importance of binding the tasks to the correct CPUs and GPUs.
Half usage
When we run with only half the GPUs being used, we can also bind the tasks to the correct CPUs. We do not encourage using only half the GPUs on a node, as this is a waste of resources. We have the following placements:
- GPU 0,2,4,6
- CPU 49,17,1,33
which gives the following output:
[Rank 0]- [...] bandwidth: 23.44455 GB/s [...]
[Rank 1]- [...] bandwidth: 23.44240 GB/s [...]
[Rank 2]- [...] bandwidth: 23.44236 GB/s [...]
[Rank 3]- [...] bandwidth: 23.44888 GB/s [...]
Total program time: 11.34055 sec
This is similar to the single task case, as there is no competition for resources.
When we run with an incorrect placement:
- GPU 0,2,4,6
- CPU 1,33,49,17
we get the following output:
[Rank 0]- [...] bandwidth: 23.27594 GB/s [...]
[Rank 1]- [...] bandwidth: 20.46180 GB/s [...]
[Rank 2]- [...] bandwidth: 20.51923 GB/s [...]
[Rank 3]- [...] bandwidth: 23.31181 GB/s [...]
Total program time: 12.42984 sec
where we see a small drop in performance for the tasks.
Other tests
We can also perform tests where 1 GPU is used by multiple tasks, or 1 NUMA domain is used by multiple tasks. In these cases, the performance drops significantly as the resources are shared between multiple tasks. The limitations originate primarily from the read speed of the CPU from RAM, or by the bandwidth of the Infinity Fabric between the NUMA domains and the GPUs.
Leonardo Booster
Using the CUDA version, the tests are executed on the Leonardo Booster system at CINECA. Here, each node has 4 NVIDIA A100 GPUs connected to 2 NUMA domains. All the GPUs are connected to the first NUMA domain. However, the distance between the NUMA domains and the GPUs is much closer than on LUMI-G. The 4 GPUs are connected to the first 16 cores of the CPU (NUMA domain 0), the other 16 cores belong to NUMA domain 1.
Placement of 1 task
When we run with a single task on Leonardo Booster, we get the following results. For the correct placement (CPU 0, GPU 0) we obtain the output:
----------------------------------
- Check Placement
- ++++++++++++++++++++++++++++++ -
- GPU 0
- CPU 1
- SIZE 33554432
---------------------------------- [...]
Full Transfer: [...]
bandwidth: 9.99197 GB/s [...]
Total program time: 3.12716 sec
For the incorrect placement (CPU 16, GPU 0) we have as output:
----------------------------------
- Check Placement
- ++++++++++++++++++++++++++++++ -
- GPU 0
- CPU 16
- SIZE 33554432
---------------------------------- [...]
Full Transfer: [...]
bandwidth: 9.68334 GB/s [...]
Total program time: 3.21515 sec
The performance difference is small, as the distance between the NUMA domains and the GPUs is small.
Again, we can also investigate the influence of the size of the data transfers:
Bandwidth for different message sizes on Leonardo Booster, © EPICURE 2025.
The performance behaviour is similar to LUMI-G.
Placement of multiple tasks
When we run with multiple tasks on Leonardo Booster, we get as output for the correct placement:
- GPU 0,1,2,3
- CPU 1,2,3,4 [...]
[Rank 0]- [...] bandwidth: 7.41894 GB/s [...]
[Rank 1]- [...] bandwidth: 7.40249 GB/s [...]
[Rank 2]- [...] bandwidth: 7.33413 GB/s [...]
[Rank 3]- [...] bandwidth: 7.43171 GB/s [...]
Total program time: 34.91509 sec
and the output for the incorrect placement:
- GPU 0,1,2,3
- CPU 16,17,18,19 [...]
[Rank 0]- [...] bandwidth: 7.36100 GB/s [...]
[Rank 1]- [...] bandwidth: 7.51615 GB/s [...]
[Rank 2]- [...] bandwidth: 7.53296 GB/s [...]
[Rank 3]- [...] bandwidth: 7.36176 GB/s [...]
Total program time: 34.65113 sec
Here, the performance difference is negligible, as all GPUs are connected to the same NUMA domain. The close placement does thus not matter significantly on this system.
We can also spread the tasks over both NUMA domains, which gives as output:
- GPU 0,1,2,3
- CPU 1,2,16,17 [...]
[Rank 0]- [...] bandwidth: 8.99438 GB/s [...]
[Rank 1]- [...] bandwidth: 9.04050 GB/s [...]
[Rank 2]- [...] bandwidth: 9.08353 GB/s [...]
[Rank 3]- [...] bandwidth: 9.22851 GB/s [...]
Total program time: 28.85168 sec
which gives a better result than placing all tasks on the same NUMA domain. This is likely due to better overall memory bandwidth when using both NUMA domains. This shows that even on systems where the GPUs are closely connected to one NUMA domain, it can still be beneficial to spread the tasks over multiple NUMA domains to improve overall memory bandwidth. Of course, this depends on the specific program and workload being used.
Conclusion
When using multiple GPUs on a single node on LUMI-G, it is important to bind the tasks to the correct CPUs. Each program has its own characteristics. To get the most out of your codes, you should first test the performance on the specific hardware you are using.
By binding the tasks to the correct CPUs and GPUs, you can achieve optimal performance for your multi-GPU applications. This performance gain can be significant, especially for large data transfers and computations. It is good practice to bind tasks correctly to avoid potential performance issues in other scenarios.
Furthermore, when writing HPC applications that use GPUs, it is essential to consider the allocation of the memory. Specific tutorials for AMD-HIP and NVIDIA-CUDA are available to help you understand the different memory spaces and how to use them effectively. Here, we showed only a minimal example of host-device transfers, but real applications often have more complex memory access patterns.
Takeaway message: always benchmark your code on the target hardware to ensure optimal performance!
Author: Bert Jorissen (UAntwerpen)