Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets
Abstract
An automated framework for high-quality multilingual translation of benchmarks addresses semantic drift and context loss issues, improving LLM evaluation reliability through advanced test-time compute scaling strategies.
The reliability of multilingual Large Language Model (LLM) evaluation is currently compromised by the inconsistent quality of translated benchmarks. Existing resources often suffer from semantic drift and context loss, which can lead to misleading performance metrics. In this work, we present a fully automated framework designed to address these challenges by enabling scalable, high-quality translation of datasets and benchmarks. We demonstrate that adapting test-time compute scaling strategies, specifically Universal Self-Improvement (USI) and our proposed multi-round ranking method, T-RANK, allows for significantly higher quality outputs compared to traditional pipelines. Our framework ensures that benchmarks preserve their original task structure and linguistic nuances during localization. We apply this approach to translate popular benchmarks and datasets into eight Eastern and Southern European languages (Ukrainian, Bulgarian, Slovak, Romanian, Lithuanian, Estonian, Turkish, Greek). Evaluations using both reference-based metrics and LLM-as-a-judge show that our translations surpass existing resources, resulting in more accurate downstream model assessment. We release both the framework and the improved benchmarks to facilitate robust and reproducible multilingual AI development.
Community
Existing multilingual benchmarks suffer from systematic translation quality issues that compromise reliable evaluation, including grammatical inconsistencies, context loss, and structural problems.
We present an automated framework that applies test-time compute scaling strategies to benchmark translation by sampling multiple candidates and intelligently combining or ranking them. Our two key methods are USI (Universal Self-Improvement) and T-RANK (Translation Ranking), which achieve higher scores on standard machine translation benchmarks compared to single-shot translation. Models evaluated on our translations show consistent improvements across all benchmarks, with LLM judges preferring our work over existing resources by ratios exceeding four to one.
We release the complete framework and translated benchmarks for MMLU, Hellaswag, ARC, and Winogrande across eight Eastern and Southern European languages to enable more reliable multilingual AI evaluation.
arXivLens breakdown of this paper ๐ https://arxivlens.com/PaperView/Details/recovered-in-translation-efficient-pipeline-for-automated-translation-of-benchmarks-and-datasets-617-dc58e226
- Executive Summary
- Detailed Breakdown
- Practical Applications
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- UrduBench: An Urdu Reasoning Benchmark using Contextually Ensembled Translations with Human-in-the-Loop (2026)
- Translation as a Scalable Proxy for Multilingual Evaluation (2026)
- Annotation-Efficient Vision-Language Model Adaptation to the Polish Language Using the LLaVA Framework (2026)
- Evaluating Monolingual and Multilingual Large Language Models for Greek Question Answering: The DemosQA Benchmark (2026)
- GRRM: Group Relative Reward Modeling for Machine Translation (2026)
- Beyond Scalar Scores: Reinforcement Learning for Error-Aware Quality Estimation of Machine Translation (2026)
- BYOL: Bring Your Own Language Into LLMs (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper