Dominick Wirzba
Chronuid
·
AI & ML interests
None yet
Recent Activity
reacted to qgallouedec's post with 🔥 about 19 hours ago
TRL v1.2 introduces the SSDTrainer 🚀
Simple Self-Distillation (SSD) from Apple's paper "Embarrassingly Simple Self-Distillation Improves Code Generation" is now available as an experimental trainer in TRL.
The recipe is as minimal as the name suggests: sample completions from the model itself at a training-time temperature, then fine-tune on those raw, unverified samples with plain cross-entropy. No reward model. No verifier. No teacher model. No reinforcement learning. Just prompts and the model.
```python
from trl.experimental.ssd import SSDConfig, SSDTrainer
trainer = SSDTrainer(
model="Qwen/Qwen3-4B-Instruct",
args=SSDConfig(temperature=0.6, top_k=20, top_p=0.95),
train_dataset=dataset,
)
trainer.train()
```
v1.2 also ships expanded tool-calling support (LLaMA 3.1 / 3.2, DeepSeek-V3), another round of KTO ↔ DPO alignment getting us closer to promoting KTO to stable, a big GRPO simplification for overlong tool results, deprecation of `use_transformers_paged`, and key fixes for VLM response parsing.
Full release notes: https://github.com/huggingface/trl/releases/tag/v1.2.0
reacted to sergiopaniego's post with 🔥 2 days ago
Earlier this month, Apple introduced Simple Self-Distillation: a fine-tuning method that improves models on coding tasks just by sampling from the model and training on its own outputs with plain cross-entropy
And… it's already supported in TRL, built by Kashif Rasul. you can really feel the pace of development in the team 🐎
Paper by Ruixiang ZHANG, He Bai, Huangjie Zheng, Navdeep Jaitly, Ronan Collobert, Yizhe Zhang at Apple 🍎
How it works: the model generates completions at a training-time temperature (T_train) with top_k/top_p truncation, then fine-tunes on them with plain cross-entropy. no labels or verifier needed
You can try it right away with this ready-to-run example (Qwen3-4B on rStar-Coder):
https://github.com/huggingface/trl/blob/main/trl/experimental/ssd/ssd.py
or benchmark a checkpoint with the eval script:
https://github.com/huggingface/trl/blob/main/trl/experimental/ssd/ssd_eval.py
One neat insight from the paper: T_train and T_eval compose into an effective T_eff = T_train × T_eval, so a broad band of configs works well. even very noisy samples still help
Want to dig deeper?
Paper: https://huggingface.co/papers/2604.01193
Trainer docs: https://huggingface.co/docs/trl/main/en/ssd_trainer