Multi-Node & Multi-GPU Inference with vLLM
Author: Emmanuel Kieffer (LUXPROVIDE)
Learning Objectives
This How-to guide demonstrates how to run a Large Language Model such as Qwen3.5-397B-A17B-GPTQ-Int4 on multiple GPU nodes using vLLM.
You will learn how to:
- Use tensor parallelism and pipeline parallelism to serve a model that does not fit on a single node with 4 GPUs.
- Set up a SLURM launcher
- Start an inference server
- Query the inference server from a remote machine (e.g., your laptop) using SSH port forwarding with
curl.
What is vLLM?
vLLM is a performant and versatile inference engine designed for deploying large language models (LLMs) at scale.
Its architecture is built to maximize request throughput while minimizing response latency, allowing it to handle demanding workloads across diverse machine learning models. By leveraging sophisticated strategies like continuous batching and memory-optimized model execution, vLLM keeps resource consumption low even when running very large models. These properties make it particularly well-suited for production-grade deployments where both responsiveness and computational efficiency are paramount.
Beyond raw performance, vLLM is compatible with a broad spectrum of model architectures and frameworks, making it adaptable to numerous use cases — spanning text
generation, language understanding, machine translation, and more.
Hugging Face access token
To download the model weights, we need a Hugging Face (HF) access token.
Steps to generate a token:
- If not already done, create a profile on Hugging Face.
- Once your profile is created, go to the “Settings > Access Tokens” page to generate a token.
- Click on “New token” and select
Readas the Type. For more information, see the Hugging Face documentation. - Copy the token and save it in a safe place. You will need it later
Qwen3.5-397B-A17B-GPTQ-Int4
For this tutorial, we use Qwen3.5-397B-A17B-GPTQ-Int4, a 4-bit GPTQ-quantized variant of the Qwen3.5 model released by the Qwen team under the Apache 2.0 license.
Qwen3.5 is a multimodal causal language model (text + vision) that uses a sparse Mixture-of-Experts (MoE) architecture combined with Gated Delta Networks:
- 397 billion total parameters, but only 17 billion are activated per forward pass (10 routed experts + 1 shared expert out of 512).
- 60 layers, hidden dimension of 4096, with a native context length of 262,144 tokens (extensible up to ~1M tokens via YaRN).
- Supports thinking mode by default (chain-of-thought reasoning before final answers).
The GPTQ-Int4 quantization compresses the model weights to 4-bit integers, drastically reducing memory requirements while preserving strong performance across reasoning,
coding, multilingual, and vision-language benchmarks.
Did you know?
What’s Mixture-of-Experts (MoE)?:
Traditional dense models activate all their parameters for every input token. A Mixture-of-Experts model instead routes each token to only a small subset of specialized sub-networks (experts).
- Qwen3.5-397B has 512 experts per MoE layer, but only 10 routed + 1 shared = 11 experts are active at any time.
- This means only ~17B parameters are used per token, even though the full model stores 397B parameters.
- The Benefit: You get the knowledge capacity of a very large model with the inference cost of a much smaller one.
What’s GPTQ?:
GPTQ (Generalized Post-Training Quantization) is a method to compress model weights after training by reducing their numerical precision.
- Original weights are typically stored in FP16 (16 bits = 2 bytes per value).
- GPTQ approximates them in Int4 (4 bits = 0.5 bytes per value), a 4× reduction in memory.
- The Trade-off: Significant memory savings with a small, generally acceptable loss in accuracy.
Why Int4 Quantization?:
Neural networks store vast amounts of numbers (weights). Those numbers are usually represented in floating-point formats:
- FP32 → 32 bits (4 bytes) per number (high precision, large memory)
- FP16 → 16 bits (2 bytes) per number
- Int4 → 4 bits (0.5 bytes) per number
The Benefit: Going from FP16 to Int4 reduces memory by 4×, which is critical for fitting very large models onto fewer GPUs.
MoE + Quantization Combined:
Combining MoE with GPTQ-Int4 provides a double advantage:
- MoE means only a fraction of the total parameters are active per inference step, reducing compute.
- Int4 quantization compresses all stored weights (including inactive experts), reducing the overall memory footprint.
- The Result: A 397B-parameter model that requires far less GPU memory and far less compute per token than a comparably-sized dense model.
Estimating the required number of nodes
This How-to guide has been tested on the EuroHPC supercomputer MeluXina. MeluXina’s GPU partition has 4 NVIDIA A100 40GB GPUs per node, i.e., 160GB per node.
We assume the following:
- GPTQ-Int4 represents 0.5 bytes (4 bits) of memory per parameter.
- Qwen3.5-397B-A17B has 397 billion total parameters.
- A node on MeluXina has 4 × A100 40GB → 160GB of GPU memory.
- vLLM defines a
gpu_memory_utilizationparameter, which by default is0.9.
Weight memory estimate:
397 × 109 × 0.5 bytes ≈ 199 GB
- With
gpu_memory_utilizationof0.9, each node provides160GB × 0.9 = 144GBof usable GPU memory. - We therefore need at least
199GB / 144GB ≈ 1.4→ 2 GPU nodes just for the model weights.
Please note that:
- Additional memory is required for the KV cache, activations, and framework overhead.
- Memory utilization is not exactly balanced across all GPUs when combining tensor parallelism and pipeline parallelism.
Preparing the SLURM launcher script
To avoid installing all dependencies and a Python virtual environment for the vLLM inference server, we will pull the container image using Apptainer. Apptainer creates a sif image in the directory from which the pull command is run.
module load Apptainer
apptainer pull docker://vllm/vllm-openai:latest
ℹ️ Apptainer as a containerized solution for HPC
Unlike Docker, which requires root privileges and a persistent daemon, Apptainer (formerly Singularity) is designed from the ground up for multi-user, shared HPC environments. Key advantages include:
- No root required – containers run with the user’s own permissions, preserving the security model of shared clusters.
- Native integration with HPC schedulers – works seamlessly with SLURM, MPI, and hihg-speed interconnects (InfiniBand, GPUDirect RDMA).
- Single-file SIF images – immutable, portable, and easy to manage on parallel file systems without layered storage backends.
- Direct GPU support – GPU drivers from the host are automaticaly exposed inside the container (e.g., via
--nvcclior--nv). - Docker compatibility – can pull and convert Docker images (
apptainer pull docker://...), giving access to the entire Docker Hub ecosystem.
We now prepare a multi-node launcher that:
- Prepares all environment variables using Apptainer.
- Starts a vLLM server on the head node and a vLLM worker on the second node using the distributed backend mp.
- Uses environment variables for HP token, cache, and model name.
Create the file named launcher_vllm_multinode_mp.sh and paste the following script. Replace hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx with a valid HF token.
#!/bin/bash -l
#SBATCH -A <ACCOUNT>
#SBATCH -q <qos>
#SBATCH -p gpu #SBATCH -t 48:00:00
#SBATCH -N 2
#SBATCH –ntasks-per-node=1
#SBATCH –cpus-per-task=64
#SBATCH –gpus-per-task=4
#SBATCH –error=”vllm-mp-%j.err”
#SBATCH –output=”vllm-mp-%j.out”
module load Apptainer
export HF_TOKEN=”hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx”
# Make sure the path to the SIF image is correct
# Here, the SIF image is in the same directory as this script
export SIF_IMAGE=”vllm-latest.sif”
export APPTAINER_ARGS=”–nvccli –env HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}”
# Make sure you have been granted access to the model
export HF_MODEL=”Qwen/Qwen3.5-397B-A17B-GPTQ-Int4″
export HEAD_HOSTNAME=”$(hostname)”
export HEAD_IPADDRESS=”$(hostname –ip-address)”
echo “HEAD NODE: ${HEAD_HOSTNAME}”
echo “IP ADDRESS: ${HEAD_IPADDRESS}”
export TENSOR_PARALLEL_SIZE=4 # Set it to the number of GPUs per node
export PIPELINE_PARALLEL_SIZE=${SLURM_NNODES} # Set it to the number of allocated GPU nodes
export CMD_BASE=”vllm serve ${HF_MODEL} –tensor-parallel-size ${TENSOR_PARALLEL_SIZE} –pipeline-parallel-size ${PIPELINE_PARALLEL_SIZE} –nnodes ${SLURM_NNODES} –master-addr ${HEAD_IPADDRESS} ”
export MODEL_PARAMS=”–max-model-len 262144 –reasoning-parser qwen3 –quantization moe_wna16 –enable-chunked-prefill”
# This script will be executed on each node to start vLLM
cat << ‘EOF’ > run_cluster_mp.sh
#!/bin/bash
echo “Running on node: $(hostname)”
IPADDRESS=”$(hostname –ip-address)”
if [[ “${IPADDRESS}” == “${HEAD_IPADDRESS}” ]]
then
apptainer exec ${APPTAINER_ARGS} –env VLLM_HOST_IP=${IPADDRESS}
${SIF_IMAGE} ${CMD_HEAD} –node-rank ${SLURM_NODEID}
else
apptainer exec ${APPTAINER_ARGS} –env VLLM_HOST_IP=${IPADDRESS}
${SIF_IMAGE} ${CMD_WORKER} –node-rank ${SLURM_NODEID}
fi
EOF
chmod +x run_cluster_mp.sh
# Command to start the head node
export CMD_HEAD=”${CMD_BASE} ${MODEL_PARAMS} ”
# Command to start workers
export CMD_WORKER=”${CMD_BASE} ${MODEL_PARAMS} –headless”
echo “Starting vLLM inference with multiprocessing distributed backend”
# Using the slurm option –labels (-l) which prepends task number to lines of stdout/err.
srun -l -N ${SLURM_NNODES} –ntasks-per-node ${SLURM_NTASKS_PER_NODE} -c ${SLURM_CPUS_PER_TASK} ./run_cluster_mp.sh
Submit the job:
sbatch launcher_vllm_multinode_mp.sh
ℹ️ Using InfiniBand
To verify that your InfiniBand adapter is leveraging GPUDirect RDMA, launch vLLM with verbose NCCL logging enabled: NCCL_DEBUG=TRACE vllm serve ... . Then inspect the logs for the NCCL version and the transport layer in use.
- If the logs contain
[send] via NET/IB/GDRDMA, NCCL is communicating over InfiniBand with GPUDirect RDMA — the optimal path for cross-node transfers. - If instead you see
[send] via NET/Socket, NCCL has fallen back to plain TCP sockets, which is significantly slower and not recommended for cross-node tensor parallelism.
Checking That the Server Is Running
Loading the model into vLLM may take a while. You can monitor the progress by following the generated output/error files: vllm-<JOB ID>.{out,err}. Make sure to replace <JOB ID> with your actual job ID.
tail-f vllm-<JOB ID>.{out,err} # Follow the latest output
Once the model is loaded, you should see the following output in tail -f vllm-<JOB ID>.out:
0: (APIServer pid=24262) INFO 04-07 11:27:38 [api_server.py:594] Starting vLLM server on http://0.0.0.0:8000
0: (APIServer pid=24262) INFO 04-07 11:27:38 [launcher.py:37] Available routes are:
0: (APIServer pid=24262) INFO 04-07 11:27:38 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET
0: (APIServer pid=24262) INFO 04-07 11:27:38 [launcher.py:46] Route: /docs, Methods: HEAD, GET
0: (APIServer pid=24262) INFO 04-07 11:27:38 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET
0: (APIServer pid=24262) INFO 04-07 11:27:38 [launcher.py:46] Route: /redoc, Methods: HEAD, GET
0: (APIServer pid=24262) INFO 04-07 11:27:38 [launcher.py:46] Route: /tokenize, Methods: POST
0: (APIServer pid=24262) INFO 04-07 11:27:38 [launcher.py:46] Route: /detokenize, Methods: POST
0: (APIServer pid=24262) INFO 04-07 11:27:38 [launcher.py:46] Route: /load, Methods: GET
0: (APIServer pid=24262) INFO 04-07 11:27:38 [launcher.py:46] Route: /version, Methods: GET
0: (APIServer pid=24262) INFO 04-07 11:27:38 [launcher.py:46] Route: /health, Methods: GET
0: (APIServer pid=24262) INFO 04-07 11:27:38 [launcher.py:46] Route: /metrics, Methods: GET
0: (APIServer pid=24262) INFO 04-07 11:27:38 [launcher.py:46] Route: /v1/models, Methods: GET
0: (APIServer pid=24262) INFO 04-07 11:27:38 [launcher.py:46] Route: /ping, Methods: GET
0: (APIServer pid=24262) INFO 04-07 11:27:38 [launcher.py:46] Route: /ping, Methods: POST
0: (APIServer pid=24262) INFO 04-07 11:27:38 [launcher.py:46] Route: /invocations, Methods: POST
0: (APIServer pid=24262) INFO 04-07 11:27:38 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
0: (APIServer pid=24262) INFO 04-07 11:27:38 [launcher.py:46] Route: /v1/responses, Methods: POST
0: (APIServer pid=24262) INFO 04-07 11:27:38 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
0: (APIServer pid=24262) INFO 04-07 11:27:38 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
0: (APIServer pid=24262) INFO 04-07 11:27:38 [launcher.py:46] Route: /v1/completions, Methods: POST
0: (APIServer pid=24262) INFO 04-07 11:27:38 [launcher.py:46] Route: /v1/messages, Methods: POST
0: (APIServer pid=24262) INFO 04-07 11:27:38 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
0: (APIServer pid=24262) INFO 04-07 11:27:38 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
0: (APIServer pid=24262) INFO 04-07 11:27:38 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
0: (APIServer pid=24262) INFO 04-07 11:27:38 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
0: (APIServer pid=24262) INFO 04-07 11:27:38 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
0: (APIServer pid=24262) INFO 04-07 11:27:38 [launcher.py:46] Route: /v1/completions/render, Methods: POST
These lines confirm that the server is up and waiting to process incoming requests.
Querying the Inference Server with curl
As you can see, the server is listening for incoming connections on port 8000. Nonetheless, to access it, we need to create an SSH tunnel.
ssh-p [port] [user]@[cluster]-N-L [local_port]:[remote_node_ip]:[remote_port]
# Let’s forward the local port 8000 to port 8000
ssh-p [port] [user]@[cluster]-N-L 8000:[remote_node_ip]:8000
Once the SSH tunnel is active, you can access the model at the following URL on your local machine: http://localhost:8000. To test that the inference server is working, you can use the following curl command:
curl -X POST "http://localhost:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "Qwen/Qwen3.5-397B-A17B-GPTQ-Int4",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this image in one sentence."
},
{
"type": "image_url",
"image_url": {
"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
}
}
]
}
]
}'
```