Scaling Deep Neural Networks for Public Safety: Fire, Flood etc. with HPC
- Name: Scaling Deep Neural Networks for Public Safety: Fire, Flood etc. with HPC
- EuroHPC machine used: MareNostrum 5
- Topic: Computer and information sciences
Overview of the project
This project focuses on developing and scaling deep neural network models for public safety applications such as fire, smoke, and flood detection. The work involves training advanced computer vision architectures and frameworks for based segmentation models, which require substantial computational resources due to their depth, large datasets and parameter size. Using EuroHPC systems, the project aims to overcome GPU memory limitations and optimize the training pipeline through techniques such as mixed precision training, gradient accumulation, activation checkpointing, and distributed strategies like DDP and FSDP. These optimizations enable efficient and scalable training of large models, ultimately improving the reliability and accuracy of hazard detection systems for real world safety and disaster response scenarios.
How did EPICURE support the project and what were the benefits of the support?
“We requested EPICURE support to overcome challenges related to scaling and optimizing large deep learning models on EuroHPC systems. Our main difficulties involved GPU memory limitations, out-of-memory errors, multi node Distributed Data Parallel (DDP) training, and communication issues when running YOLO at scale.
EPICURE provided extensive technical assistance by resolving OOM issues, implementing gradient accumulation, enabling multi node multi GPU training, and supplying customized SLURM and Python scripts. They also assisted with performance profiling, enhanced logging for GPU metrics and synchronization, and conducted a detailed investigation into NCCL. Their support significantly improved training stability, resource utilization, and our understanding of distributed deep learning on HPC systems.
The support from EPICURE brought major improvements in both efficiency and reliability when running large scale deep learning workloads on EuroHPC systems. Their guidance helped us resolve memory bottlenecks, stabilize multi-node training, and optimize GPU usage, which together reduced troubleshooting time and improved overall training throughput. The customized scripts, profiling tools, and enhanced logging provided by EPICURE enabled us to diagnose issues much faster and avoid inefficient resource usage. Their insights into NCCL behaviour and scaling limitations prevented unnecessary compute expenditure and clarified the practical boundaries of multi node object detection training. Overall, the collaboration resulted in time savings, more efficient use of HPC resources, and a significantly stronger understanding of distributed AI training.” – Jaime Santos, CEO & Co-founder, AiTecServ
Additional references
Cheriyan, J. and Santos, J. (2025). ‘Scalable Detection of Environmental Events on EuroHPC MeluXina’, Procedia Computer Science, 267, pp. 246–255. doi:10.1016/j.procs.2025.08.251.
Project website:
https://www.aitecserv.pt/solutions/aiactionContact the project:
- AiTecServ (services@aitecserv.pt)