Enhancing AI training efficiency and scalability to train EU institutional LLMs for the public sector
- Name: Enhancing AI training efficiency and scalability to train EU institutional LLMs for the public sector
- EuroHPC machine used: Leonardo
- Topic: Information engineering
The European Commission’s DG Translation is developing large language models (LLMs) for generative artificial intelligence (AI) that are tailored to the needs of the European Institutions and public sector. In particular, this project aimed to create LLMs that are more adapted, controllable, and respectful of European values, with better coverage of all EU official languages. To achieve this, the project leveraged high-quality internal data from the European Institutions and used continued pre-training of existing open-source LLMs (Mixtral-8x7B and Mixtral-8x22B). The goal was to find configurations for better infrastructure utilization, enable compute and energy efficient continued pre-training and increase the multilingual model performance with limited catastrophic forgetting. Ultimately, the project aimed to provide a basis for more high-quality, linguistically diverse, and safe LLMs trained on institutional data.
The project used HPC resources to train LLMs, which required significant processing power. To scale the training across multiple GPU nodes, multiple parameters had to be adjusted:
“However, our team, while experienced in developing and training AI models, lacked expertise in optimizing algorithms for large-scale HPC systems like the Leonardo supercomputer. EPICURE’s expert assistance was crucial in performance optimization, customizing software settings, and identifying the best setup. Their help addressed knowledge gaps in intra-node communication patterns, NCCL, low-level profiling, and Leonardo’s specific design, which measurably reduced computation time and resource consumption.
The overall benefit of the support we received from EPICURE was considerable improvements in training efficiency, achieved through optimized configurations and parallelization parameter testing. Specifically, their guidance helped us reduce network congestion, improve training speed, and resolve debugging issues with PyTorch and container setup. By providing hands-on assistance and project-specific expertise, EPICURE enabled us to overcome key challenges and make the most of the Leonardo system’s capabilities. These improvements saved time, compute, and energy.”