arxiv:2606.31551

AutoTrainess: Teaching Language Models to Improve Language Models Autonomously

Published on Jun 30

· Submitted by

Authors:

Abstract

AutoTrainess enables autonomous language model training by providing structured agent-computer interfaces that guide planning, data preparation, training, evaluation, and logging operations more effectively than traditional command-line approaches.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Training language models (LMs) remains a highly human-intensive process, even as frontier language model agents become increasingly capable at software engineering and other long-horizon tasks. A central challenge is that autonomous post-training is not just a coding problem: it requires the agent to repeatedly plan iterations, construct benchmark-aligned data, run stable training jobs, evaluate checkpoints, and preserve experiment state across many hours of interaction. We present AutoTrainess, a LM agent that exposes these operations as a repository of agent-computer interfaces for planning, data preparation, training, evaluation, and logging. Rather than leaving the agent to operate in a raw CLI environment with an underspecified action space, AutoTrainess externalizes prior human experience as explicit workflows, rules, and execution constraints that guide the agent toward effective and reliable training behavior. On PostTrainBench, AutoTrainess consistently outperforms CLI-only baselines, achieving 26.94 average score with GPT-5.4 (Codex) versus 23.21 for CLI-only. It also generalizes across models and harnesses, improving DeepSeek-V4-Flash (OpenCode) from 12.13 to 19.58.

View arXiv page View PDF GitHub 7 Add to collection

Community

zjy2001

Paper submitter 1 day ago

AutoTrainess: Teaching Language Models to Improve Language Models Autonomously

TimeLordRaps

about 22 hours ago

How big do you think a model needs to be to be able to escape? Like size helps it be smart enough to escape, but also makes transferring and hiding its weights significantly harder, though have a big enough model and it can figure out decentralized serving and self-scaling. It's so end game lol. I built something similar, with more modularity and standardization of its interface. Direct it at itself as a training environment, ie can current agents train a model to train models. fable 5 training sonnet 5 to train haiku 5s.