Research output

Publications

Selected papers, preprints, and conference contributions related to the project.

Showing 12 of 38 publications
Publication2026

An Open-Source Training Dataset for Foundation Models for Black-box Optimization

A Klein, H Rakotoarison, L Thale-Bombien…

arXiv preprint arXiv …,

Most black-box optimization methods require extensive hyperparameter tuning, often limiting their ability to generalize across different optimization domains. Foundation models for …

Publication2026

BBOmix: A Tabular Benchmark for Hyperparameter Optimization of Unsupervised Biological Representation Learning

L Thale-Bombien, J Ewald, R König, A Klein

arXiv preprint arXiv …,

The rapid advancement of high-throughput sequencing has led to large, high-dimensional omics datasets. Deep unsupervised learning architectures, particularly Autoencoders (AEs), …

Publication2026

ChemPile: A 250 GB Diverse and Curated Dataset for Chemical Foundation Models

A Mirza, N Alampara, M Ríos-García…

Advances in …,

Foundation models have shown remarkable success across scientific domains, yet their impact in chemistry remains limited due to the absence of diverse, large-scale, high-quality …

Publication2026

Deriving hyperparameter scaling laws via modern optimization theory

E Shulgin, D von Rütte, TH Zhang, N Ajroldi…

arXiv preprint arXiv …,

Hyperparameter transfer has become an important component of modern large-scale training recipes. Existing methods, such as muP, primarily focus on transfer between model sizes, …

Publication2026

Detecting generalization deficits in large language and reasoning models by using natural variations in simple problems

M Nezhurina, L Cipolina-Kun, M Cherti…

… on Machine Learning …,

Large language and reasoning models (LLMs, LRMs) are instances of foundation models exhibiting scaling laws that predict generalization improvement when increasing the pre-…

Publication2026

Different Time, Different Language: Revisiting the Bias Against Non-Native Speakers in GPT Detectors

A Al Ali, J Helcl, J Libovický

… of the 19th Conference of the …,

LLM-based assistants have been widely popularised after the release of ChatGPT. Concerns have been raised about their misuse in academia, given the difficulty of distinguishing …

Publication2026

ELOQUENT Lab at CLEF 2026: Evaluation of Generative Language Model Quality

J Karlgren, M Barrett, O Bojar, MI Engels…

… on Information Retrieval,

The ELOQUENT lab for evaluation of generative language model quality and usefulness addresses high-level quality criteria for generative language models through a set of open-…

Publication2026

Knowledge Distillation as Decontamination? Revisiting the “Data Laundering” Concern in Classification Tasks

H Luo, R Vázquez, T Mickus, F Ginter…

The Fourteenth …,

Concerns have been raised that knowledge distillation may transfer test-set knowledge from a contaminated teacher to a clean student---a "data laundering" effect that potentially …

Publication2026

Learning in Compact Spaces with Approximately Normalized Transformer

J Franke, U Spiegelhalter…

Advances in …,

The successful training of deep neural networks requires addressing challenges such as overfitting, numerical instabilities leading to divergence, and increasing variance in the residual …

Publication2026

Machine Translation for Low-Resource Languages through Monolingual Data and LLM: A Case Study of English-to-Basque

N Luu, A Soroa, G Rigau, O Bojar

… of the 19th Conference of the …,

Developing a machine translation (MT) system requires a considerable amount of high-quality parallel data, which is often limited for low-resource languages. This paper explores the …

Publication2026

On the Limits of Model Merging for Multilinguality in Pre-Training

S Aycock, F Vitiugin, A Umnov, C Monz…

arXiv preprint arXiv …,

Endowing models with consistent multilingual performance can be achieved by mixing pre-training data, or post-training approaches such as language-specific model merging. In this …

Publication2026

Open Machine Translation for Esperanto

O de Gibert, L de Gibert

arXiv preprint arXiv:2603.29345,

… 101195233. The contents of this publication are the sole responsibility of its authors and do not necessarily reflect the opinion of the European Union. …