Enhancing EUBERT: Retraining the European Parliament’s LLM with Contemporary Parliamentary Data

  • Name: Enhancing EUBERT: Retraining the European Parliament’s LLM with Contemporary Parliamentary Data
  • EuroHPC machine used: MeluXina
  • Topic: Computer and information sciences

Overview of the project

EUBERT is a pretrained BERT uncased model that has been trained on a vast corpus of documents registered by the European Publications Office. These documents span the last 30 years, providing a comprehensive dataset that encompasses a wide range of topics and domains. EUBERT is designed to be a versatile language model that can be fine-tuned for various natural language processing tasks, making it a valuable resource for a variety of applications.
Building on EUBERT, this project developed a large-scale multi-label classification model aligned with EuroVoc, a hierarchical thesaurus comprising more than 7 thousand concepts used to index and organize EU documents. The resulting models enable improved semantic understanding, classification, and retrieval of European legislative and policy texts.

 

How did EPICURE support the project and what were the benefits of the support?

“We requested EPICURE support to address performance, scalability, and operational challenges encountered when retraining large Transformer-based models on the MeluXina infrastructure. In particular, we asked for guidance on efficient multi-GPU execution, memory management, and job orchestration within the Slurm scheduler. The support team provided input on GPU utilization strategies, distributed training configurations, and profiling approaches, and supported us in setting up and refining batch job submission scripts. This assistance helped us adapt our training workflow to the specific constraints and operational model of the EuroHPC environment and enabled more stable execution of long-running GPU workloads.

The support from EPICURE contributed to a structured and reproducible training workflow and improved utilization of the allocated GPU resources. By clarifying best practices for job scheduling, parallel execution, and resource configuration, the support reduced the amount of trial-and-error required when running large-scale training jobs on MeluXina. While the overall training process remained computationally demanding, the guidance we received helped ensure stable execution, better resource awareness, and a clearer understanding of performance characteristics while retraining the EUBERT and EuroVoc models on EuroHPC systems.” – Andreas Papagiannis

Contact the project:

  • Andreas Papagiannis (andreas.papagiannis@europarl.europa.eu)
  • Charalampos Moschopoulos (charalampos.moschopoulos@europarl.europa.eu)