You can get the unity environment from GitHub.
Model Card: PPO Agent on 24x24-GrassWorld Stochastic (ToucanHush Environment)
Model Details
- Model Type: Proximal Policy Optimization (PPO)
- Framework: Stable-Baselines3
- Environment: Custom Unity ML-Agents environment (ToucanHush-24x24-GrassWorld Stochastic)
- Author: Ahmed El Mahdi BENDOU
- License: MIT
- Status: Prototype (intermediate-stage training, stochastic baseline policy)
This model is a continuation of the curriculum learning effort. Unlike the earlier 12x12 deterministic setup, this version introduces stochastic transitions and a larger grid, requiring the agent to generalize navigation and throwing strategies under uncertainty.
Intended Use
- Baseline reference for stochastic and larger-grid environments.
- Educational demonstration of Unity ML-Agents + PPO training under stochasticity.
Not intended for production or safety-critical applications.
Environment Specification
Name: 24x24-GrassWorld Stochastic
Grid size: 24 × 24
Agent Actions:
- Move Forward
- Move Backward
- Turn Left / Turn Right
- Throw Banana 🍌
- Do nothing
Rewards:
- +1 for reaching/scoring a stationed toucan
- -1 for bumping into walls
- -0.01 penalty per step (encourages efficiency)
Special Mechanics:
- Agent can throw at ~27° to hit a distant toucan.
- Stochastic transitions:
- Throw outcomes may vary probabilistically as Toucans spawn in random locations each episode.
Training Details
Trainer: PPO
Max steps: 20,000,000
Checkpoint frequency: every 250,000 steps
Hyperparameters
Batch size: 1024
Buffer size: 102,400
Learning rate: 0.0001 (linear decay)
β (entropy regularization): 0.001
ε (PPO clip range): 0.2
λ (GAE): 0.99
Epochs per update: 3
Time horizon: 1000
Network Settings
Hidden units: 256
Layers: 2 fully connected
Normalization: Enabled
Reward Signals
Extrinsic:
γ = 0.99
Strength = 1.0
The policy has achieved partial competency: navigation is improved compared to the 12x12 baseline, but efficiency drops due to stochasticity. Throw usage is more adaptive but inconsistent.
Evaluation
Observed Behavior:
The agent learns to adapt to uncertain movement and occasionally uses throw effectively. However, exploration in the larger state space remains inefficient.Limitations:
- Sensitivity to stochastic randomness in transitions.
- Difficulty in scaling exploration to 24x24 grid.
- Suboptimal throw frequency in high-uncertainty states.
Future Work
This model represents the second step in the curriculum learning experiment:
- Deterministic small grid (12x12)
- Stochastic larger grid (24x24)
- Planned: adversarial settings, dynamic rewards, multi-agent scenarios.
- Better logging and reproducibility pipelines on GitHub.
Citation
If you use this model, please cite:
@misc{bendou2025grassworldppo,
author = {Ahmed El Mahdi BENDOU},
title = {PPO Agent trained on ToucanHush 24x24-GrassWorld Stochastic (Unity ML-Agents)},
year = {2025},
howpublished = {\url{https://huggingface.co/partzel/ToucanHush-24x24GrassWorldStochastic}},
}
Assets Pack
All assets have been custom made for this environment and you can get them for free from here
- Downloads last month
- 19
