You can get the unity environment from GitHub.

Model Card: PPO Agent on 24x24-GrassWorld Stochastic (ToucanHush Environment)

Model Details

Model Type: Proximal Policy Optimization (PPO)
Framework: Stable-Baselines3
Environment: Custom Unity ML-Agents environment (ToucanHush-24x24-GrassWorld Stochastic)
Author: Ahmed El Mahdi BENDOU
License: MIT
Status: Prototype (intermediate-stage training, stochastic baseline policy)

This model is a continuation of the curriculum learning effort. Unlike the earlier 12x12 deterministic setup, this version introduces stochastic transitions and a larger grid, requiring the agent to generalize navigation and throwing strategies under uncertainty.

Intended Use

Baseline reference for stochastic and larger-grid environments.
Educational demonstration of Unity ML-Agents + PPO training under stochasticity.

Not intended for production or safety-critical applications.

Environment Specification

Name: 24x24-GrassWorld Stochastic
Grid size: 24 × 24

Agent Actions:

Move Forward
Move Backward
Turn Left / Turn Right
Throw Banana 🍌
Do nothing

Rewards:

+1 for reaching/scoring a stationed toucan
-1 for bumping into walls
-0.01 penalty per step (encourages efficiency)

Special Mechanics:

Agent can throw at ~27° to hit a distant toucan.
Stochastic transitions:
- Throw outcomes may vary probabilistically as Toucans spawn in random locations each episode.

Training Details


Trainer: PPO
Max steps: 20,000,000
Checkpoint frequency: every 250,000 steps

Hyperparameters
Batch size: 1024
Buffer size: 102,400
Learning rate: 0.0001 (linear decay)
β (entropy regularization): 0.001
ε (PPO clip range): 0.2
λ (GAE): 0.99
Epochs per update: 3
Time horizon: 1000

Network Settings
Hidden units: 256
Layers: 2 fully connected
Normalization: Enabled
Reward Signals
Extrinsic:
γ = 0.99
Strength = 1.0

The policy has achieved partial competency: navigation is improved compared to the 12x12 baseline, but efficiency drops due to stochasticity. Throw usage is more adaptive but inconsistent.

Evaluation

Observed Behavior:
The agent learns to adapt to uncertain movement and occasionally uses throw effectively. However, exploration in the larger state space remains inefficient.
Limitations:
- Sensitivity to stochastic randomness in transitions.
- Difficulty in scaling exploration to 24x24 grid.
- Suboptimal throw frequency in high-uncertainty states.

Future Work

This model represents the second step in the curriculum learning experiment:

Deterministic small grid (12x12)
Stochastic larger grid (24x24)
Planned: adversarial settings, dynamic rewards, multi-agent scenarios.
Better logging and reproducibility pipelines on GitHub.

Citation

If you use this model, please cite:

@misc{bendou2025grassworldppo,
  author       = {Ahmed El Mahdi BENDOU},
  title        = {PPO Agent trained on ToucanHush 24x24-GrassWorld Stochastic (Unity ML-Agents)},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/partzel/ToucanHush-24x24GrassWorldStochastic}},
}

Assets Pack

All assets have been custom made for this environment and you can get them for free from here

Downloads last month: 19

Video Preview

Reinforcement Learning

Collection including partzel/ToucanHush-24x24GrassWorldStochastic

ToucanHush RL Models

Collection

Models trained on RL the ToucanHush Unity Custom Environment • 2 items • Updated Sep 7, 2025