May. 27, 2025

MultiSynt: Advancing Multilingual AI Through Open Synthetic Training Data

The EuroHPC AI Factory Large Scale call has allocated 3 million GPU hours on the Leonardo Booster (CINECA, Italy) to develop "MultiSynt: an open multilingual synthetic dataset for LLM pre-training”. This initiative, led by Prompsit Language Engineering, brings together the two major European open AI initiatives—EuroLLM and OpenEuroLLM—in a collaborative effort to address a fundamental gap in multilingual LLM development.

Project Overview

The idea of building a multilingual synthetic dataset with particular focus on official EU languages originated from ellamind, a partner at OpenEuroLLM. MultiSynt builds upon the methodology established by Nemotron-CC for English extending its approach to a multilingual dimension.

The work will take place for six months on the LEONARDO BOOSTER module, a system specifically designed for computationally-demanding tasks that require rapid time-to-solution while maintaining energy efficiency.

“This is an important step in securing large enough computing power to build an essential asset for the OpenEuroLLM project. I am also glad that this has been done in collaboration with the experienced team from the EuroLLM project. The goal of this subproject is to explore multilingual synthetic data creation and evaluate their use in order to reach a higher common goal: building high-quality multilingual LLMs for all European languages and beyond.” notes Jan Hajic, Charles University, coordinator of the OpenEuroLLM project.

Addressing the Multilingual Data Gap

Developing effective multilingual foundation models requires diverse, high-quality pre-training data across all target languages. While English-language resources are plentiful, most European languages face significant shortages in both the quantity and quality of available open pre-training data.

Current data collection efforts alone cannot adequately address this scarcity, which limits proper representation of many languages in multilingual models. Even languages with relatively good resource availability face gaps in content diversity and quality, creating obstacles for developing effective cross-lingual models.

Without addressing these dataset limitations, the risk remains high of producing underperforming models that lack the capabilities needed for effective downstream applications across xtend the multilingual capabilities of existing models for EU official languageall European languages.

The MultiSynt Approach

MultiSynt directly supports the broader EuroLLM and OpenEuroLLM initiatives by targeting a critical bottleneck in multilingual LLM development: the availability of high-quality pre-training data. The core innovation of MultiSynt lies in creating the first comprehensive multilingual synthetic pre-training dataset. Until now, synthetic data generation for LLMs has been primarily limited to English. The MultiSynt approach leverages generative models to enhance existing content, focusing on improvements in:

Language representation across all EU languages and beyond
Domain coverage and sufficient volume to ensure a breadth of knowledge that impacts model training
Content diversity to build robust generalization capabilities
Data quality improvements through targeted generation

Most importantly, all resources will be made openly available to researchers and developers.

This represents a substantial advancement beyond current approaches, as no large-scale attempts exist to create such multilingual synthetic resources. The impact will extend to the entire LLM community by democratizing access to high-quality training data for languages currently underrepresented in AI development.

“MultiSynt enables the development of truly multilingual models that perform consistently across linguistic boundaries. This opens new possibilities for research and applications previously constrained by language limitations.” says Gema Ramírez Sánchez, CEO at Prompsit Language Engineering.

The project directly supports European AI strategy by fostering capabilities in multilingual AI and providing crucial resources for European researchers and SMEs. Through open availability of the resulting dataset, MultiSynt aims to improve access to quality pre-training resources for all European languages, advancing the state of open and transparent AI development within Europe.