arxiv:2602.22207

Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets

Published on Feb 25

· Submitted by

Hanna Yukhymenko on Mar 2

Institute for Computer Science, Artificial intelligence and Technology

Upvote

Authors:

Hanna Yukhymenko ,

Anton Alexandrov ,

Abstract

An automated framework for high-quality multilingual translation of benchmarks addresses semantic drift and context loss issues, improving LLM evaluation reliability through advanced test-time compute scaling strategies.

AI-generated summary

The reliability of multilingual Large Language Model (LLM) evaluation is currently compromised by the inconsistent quality of translated benchmarks. Existing resources often suffer from semantic drift and context loss, which can lead to misleading performance metrics. In this work, we present a fully automated framework designed to address these challenges by enabling scalable, high-quality translation of datasets and benchmarks. We demonstrate that adapting test-time compute scaling strategies, specifically Universal Self-Improvement (USI) and our proposed multi-round ranking method, T-RANK, allows for significantly higher quality outputs compared to traditional pipelines. Our framework ensures that benchmarks preserve their original task structure and linguistic nuances during localization. We apply this approach to translate popular benchmarks and datasets into eight Eastern and Southern European languages (Ukrainian, Bulgarian, Slovak, Romanian, Lithuanian, Estonian, Turkish, Greek). Evaluations using both reference-based metrics and LLM-as-a-judge show that our translations surpass existing resources, resulting in more accurate downstream model assessment. We release both the framework and the improved benchmarks to facilitate robust and reproducible multilingual AI development.

View arXiv page View PDF Project page GitHub 14 Add to collection

Community

hannayukhymenko

Paper author Paper submitter about 21 hours ago

Existing multilingual benchmarks suffer from systematic translation quality issues that compromise reliable evaluation, including grammatical inconsistencies, context loss, and structural problems.

We present an automated framework that applies test-time compute scaling strategies to benchmark translation by sampling multiple candidates and intelligently combining or ranking them. Our two key methods are USI (Universal Self-Improvement) and T-RANK (Translation Ranking), which achieve higher scores on standard machine translation benchmarks compared to single-shot translation. Models evaluated on our translations show consistent improvements across all benchmarks, with LLM judges preferring our work over existing resources by ratios exceeding four to one.

We release the complete framework and translated benchmarks for MMLU, Hellaswag, ARC, and Winogrande across eight Eastern and Southern European languages to enable more reliable multilingual AI evaluation.