LLM-using-llama

Author: Alícia Oliveira (INESC TEC, Deucalion)

How to run an LLM using llama.cpp on the Deucalion ARM partition

Large Language Models (LLMs) have been increasingly used in a diverse range of environments, with a growing emphasis on inference and on how to make it efficient and scalable. This guide explains how to run Meta Llama 3.1 8B using llama.cpp on the Deucalion ARM partition, highlighting why ARM is an interesting choice for LLM inference.


1) Deucalion and the ARM partition

Deucalion is a supercomputer that features an ARM-based partition, utilizing Fujitsu A64FX ARM (Advanced RISC Machines) processors. This partition is designed to deliver high performance with improved energy efficiency.

ARM partition is particularly used to:

  • Inference-focused workloads
  • Scientific simulations
  • Large-scale data processing

The ARM architecture is based on a reduced instruction set (RISC), which enables a simpler and more efficient design, resulting in lower power consumption and reduced heat generation.


2) The importance of inference

As LLMs are increasingly deployed and used, the quality of inference becomes influenced by factors such as performance, efficiency, and resource utilization.

Optimizing these factors simultaneously is non-trivial. Inference behavior depends on several conditions, including the hardware and software stack, as well as offline and real-time services. As a result, effective LLM inference requires balancing trade-offs between performance and energy consumption.

Thus, the emphasis on inference efficiency highlights the relevance of inference engines such as llama.cpp, combined with energy-efficient architectures like ARM, forming a suitable solution for LLM deployment.


3) Running Llama 3.1 8B on a single ARM node

The following steps demonstrate how to run Meta Llama 3.1 8B using llama.cpp on the Deucalion ARM partition.

3.1 Create an interactive ARM session

srun -p dev-arm -A <account_name> -t 4:00:00 --pty bash

3.2 Create a working directory

mkdir llama-models
cd llama-models

3.3 Load the llama.cpp module

module load llama.cpp/20251110-foss-2023a

3.4 Download the GGUF model from Hugging Face

huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
  --include "*Q4_K_M*.gguf" \
  --local-dir ./models/llama3.1-8b \
  --local-dir-use-symlinks False

Note:
If the download fails, ensure that huggingface-cli is available on the system and that you are logged in (huggingface-cli login).
Access to Meta Llama models may require accepting the corresponding license on Hugging Face.

3.5 Run the model

You can run Meta Llama 3.1 8B either in an interactive terminal session or by exposing it through an HTTP server for browser-based interaction.

3.5.1 Interactive CLI

llama-cli -m models/llama3.1-8b/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf

After loading, the model starts an interactive inference session directly in the terminal, allowing you to input prompts and receive responses in real time.

3.5.2 Using an HTTP server

Alternatively, llama.cpp provides a built-in HTTP server that exposes the model through a web interface.

Start the server on the compute node:

llama-server -m models/llama3.1-8b/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  --host 0.0.0.0 \
  --port 8080

This launches the model as a service listening on port 8080.

From a local machine, create an SSH tunnel to forward the port from the compute node:

ssh <your_username>@login.deucalion.macc.fccn.pt -L 8080:<compute_node>:8080

Replace <your_username> with your Deucalion username and <compute_node> with the allocated ARM compute node (for example, cna0001).

Once the tunnel is active, open a web browser and navigate to:

http://localhost:8080

You will be able to interact with the model through the browser-based interface.

Figure: Web interface served by llama-server, allowing interactive inference through a browser.

In addition to llama-cli and llama-server, llama.cpp also provides other tools such as llama-bench and llama-perplexity. A complete overview of all available tools and capabilities can be found in the llama.cpp documentation: https://github.com/ggml-org/llama.cpp/blob/master/README.md


Conclusion

Combining llama.cpp with ARM-based partitions provides an effective and energy-efficient approach for LLM inference.
As inference becomes the dominant AI workload, architectures that balance strong performance with lower energy impact can be a valuable option, particularly for smaller AI models.