Through a joint collaboration between the HPLT¹ and OpenEuroLLM² initiatives, we announce the release of 38 monolingual reference models with 2.15B parameters. Full models and intermediate checkpoints every 1,000 steps can be downloaded from the HPLT collection “HPLT 2.0 Monolingual reference models” at HuggingFace.³
Trained on the HPLT v2 cleaned dataset⁴, these models cover a wide range of languages, including official European Union languages as well as several additional relevant ones. Each model was trained on 100B tokens using the Gemma-3-27B tokenizer and follows the LLaMA architecture with 2.15B parameters. Training was carried out on the LUMI supercomputer, with a compute budget of approximately 3,000 GPU hours per model on AMD MI250X GPUs.
The aim of this release is to provide a transparent and easily reproducible set of models that can serve many purposes such as cross-lingual comparison, inspection of monolingual performance or understanding of popular evaluation tasks for different languages.
To illustrate this, we share several examples of evaluation results for the 2.15B reference models:
This plot represents evaluation results for monolingual performance on the Belebele benchmark (multiple-choice machine reading comprehension) for 30 of the trained models (last checkpoint). Languages are sorted by performance taking into account different scripts when available in the HPLT v2 dataset.
More detailed results can be computed for each language, e.g. performance evolution across the full 100B training tokens. The below figures illustrate such results for the same Belebele benchmark for French and Hindi.
Many more evaluations can be performed using these openly released 2.15B models trained on 100B tokens. We hope this contributes to more transparent LLM development and plan to conduct further experiments with additional datasets, benchmarks and upcoming HPLT dataset versions soon!
Links:¹https://hplt-project.org/²https://openeurollm.eu/³https://huggingface.co/collections/HPLT/hplt-20-monolingual-reference-models-683047c3e47e16c2a5bb25af⁴https://huggingface.co/datasets/HPLT/HPLT2.0_cleaned