arxiv:2512.01374

Stabilizing Reinforcement Learning with LLMs: Formulation and Practices

Published on Dec 1

· Submitted by

Bowen Yu on Dec 2

Qwen

Upvote

Authors:

Chujie Zheng ,

Huiqiang Jiang ,

An Yang ,

Abstract

The paper provides a theoretical foundation for optimizing sequence-level rewards in reinforcement learning using token-level objectives, highlighting the importance of techniques like importance sampling correction, clipping, and Routing Replay for stabilizing training, especially with large language models.

AI-generated summary

This paper proposes a novel formulation for reinforcement learning (RL) with large language models, explaining why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE. Specifically, through a first-order approximation, we show that this surrogate becomes increasingly valid only when both the training-inference discrepancy and policy staleness are minimized. This insight provides a principled explanation for the crucial role of several widely adopted techniques in stabilizing RL training, including importance sampling correction, clipping, and particularly Routing Replay for Mixture-of-Experts (MoE) models. Through extensive experiments with a 30B MoE model totaling hundreds of thousands of GPU hours, we show that for on-policy training, the basic policy gradient algorithm with importance sampling correction achieves the highest training stability. When off-policy updates are introduced to accelerate convergence, combining clipping and Routing Replay becomes essential to mitigate the instability caused by policy staleness. Notably, once training is stabilized, prolonged optimization consistently yields comparable final performance regardless of cold-start initialization. We hope that the shared insights and the developed recipes for stable RL training will facilitate future research.

View arXiv page View PDF Add to collection

Community

Tigerph

Paper submitter 4 days ago

From the simple and intuitive perspective of "first-order approximation", we formulate and explain the rationale behind optimizing sequence-level rewards using token-level objectives, and highlight that the validity of this approximation requires minimizing both the "train-inference gap" and "policy staleness".
Our formulation provides a principled explanation: stabilization techniques such as importance sampling (IS) correction, clipping, and Routing Replay all fundamentally serve to maintain the validity of this first-order approximation.
We conduct extensive experiments using a 30B MoE model (over ×00,000 GPU hours, with FP8 inference and BF16 training), which strongly validate the above predictions and help us identify effective recipes for stable RL training. In particular, we demonstrate that as long as training remains stable over the long term, different cold-start initializations consistently converge to similar performance levels. We firmly believe that stability is the key to scaling reinforcement learning!