Stabilizing Reinforcement Learning with LLMs: Formulation and Practices
Abstract
The paper provides a theoretical foundation for optimizing sequence-level rewards in reinforcement learning using token-level objectives, highlighting the importance of techniques like importance sampling correction, clipping, and Routing Replay for stabilizing training, especially with large language models.
This paper proposes a novel formulation for reinforcement learning (RL) with large language models, explaining why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE. Specifically, through a first-order approximation, we show that this surrogate becomes increasingly valid only when both the training-inference discrepancy and policy staleness are minimized. This insight provides a principled explanation for the crucial role of several widely adopted techniques in stabilizing RL training, including importance sampling correction, clipping, and particularly Routing Replay for Mixture-of-Experts (MoE) models. Through extensive experiments with a 30B MoE model totaling hundreds of thousands of GPU hours, we show that for on-policy training, the basic policy gradient algorithm with importance sampling correction achieves the highest training stability. When off-policy updates are introduced to accelerate convergence, combining clipping and Routing Replay becomes essential to mitigate the instability caused by policy staleness. Notably, once training is stabilized, prolonged optimization consistently yields comparable final performance regardless of cold-start initialization. We hope that the shared insights and the developed recipes for stable RL training will facilitate future research.
Community
From the simple and intuitive perspective of "first-order approximation", we formulate and explain the rationale behind optimizing sequence-level rewards using token-level objectives, and highlight that the validity of this approximation requires minimizing both the "train-inference gap" and "policy staleness".
Our formulation provides a principled explanation: stabilization techniques such as importance sampling (IS) correction, clipping, and Routing Replay all fundamentally serve to maintain the validity of this first-order approximation.
We conduct extensive experiments using a 30B MoE model (over ×00,000 GPU hours, with FP8 inference and BF16 training), which strongly validate the above predictions and help us identify effective recipes for stable RL training. In particular, we demonstrate that as long as training remains stable over the long term, different cold-start initializations consistently converge to similar performance levels. We firmly believe that stability is the key to scaling reinforcement learning!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Stabilizing MoE Reinforcement Learning by Aligning Training and Inference Routers (2025)
- Soft Adaptive Policy Optimization (2025)
- Towards Stable and Effective Reinforcement Learning for Mixture-of-Experts (2025)
- ST-PPO: Stabilized Off-Policy Proximal Policy Optimization for Multi-Turn Agents Training (2025)
- BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping (2025)
- ASPO: Asymmetric Importance Sampling Policy Optimization (2025)
- Defeating the Training-Inference Mismatch via FP16 (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper