You can get the unity environment from GitHub.


Model Card: PPO Agent on 24x24-GrassWorld Stochastic (ToucanHush Environment)

Model Details

  • Model Type: Proximal Policy Optimization (PPO)
  • Framework: Stable-Baselines3
  • Environment: Custom Unity ML-Agents environment (ToucanHush-24x24-GrassWorld Stochastic)
  • Author: Ahmed El Mahdi BENDOU
  • License: MIT
  • Status: Prototype (intermediate-stage training, stochastic baseline policy)

This model is a continuation of the curriculum learning effort. Unlike the earlier 12x12 deterministic setup, this version introduces stochastic transitions and a larger grid, requiring the agent to generalize navigation and throwing strategies under uncertainty.


Intended Use

  • Baseline reference for stochastic and larger-grid environments.
  • Educational demonstration of Unity ML-Agents + PPO training under stochasticity.

Not intended for production or safety-critical applications.


Environment Specification

Name: 24x24-GrassWorld Stochastic
Grid size: 24 × 24

Agent Actions:

  • Move Forward
  • Move Backward
  • Turn Left / Turn Right
  • Throw Banana 🍌
  • Do nothing

Rewards:

  • +1 for reaching/scoring a stationed toucan
  • -1 for bumping into walls
  • -0.01 penalty per step (encourages efficiency)

Special Mechanics:

  • Agent can throw at ~27° to hit a distant toucan.
  • Stochastic transitions:
    • Throw outcomes may vary probabilistically as Toucans spawn in random locations each episode.

Environment Specification


Training Details


Trainer: PPO
Max steps: 20,000,000
Checkpoint frequency: every 250,000 steps

Hyperparameters
Batch size: 1024
Buffer size: 102,400
Learning rate: 0.0001 (linear decay)
β (entropy regularization): 0.001
ε (PPO clip range): 0.2
λ (GAE): 0.99
Epochs per update: 3
Time horizon: 1000

Network Settings
Hidden units: 256
Layers: 2 fully connected
Normalization: Enabled
Reward Signals
Extrinsic:
γ = 0.99
Strength = 1.0

The policy has achieved partial competency: navigation is improved compared to the 12x12 baseline, but efficiency drops due to stochasticity. Throw usage is more adaptive but inconsistent.


Evaluation

  • Observed Behavior:
    The agent learns to adapt to uncertain movement and occasionally uses throw effectively. However, exploration in the larger state space remains inefficient.

  • Limitations:

    • Sensitivity to stochastic randomness in transitions.
    • Difficulty in scaling exploration to 24x24 grid.
    • Suboptimal throw frequency in high-uncertainty states.

Future Work

This model represents the second step in the curriculum learning experiment:

  1. Deterministic small grid (12x12)
  2. Stochastic larger grid (24x24)
  3. Planned: adversarial settings, dynamic rewards, multi-agent scenarios.
  4. Better logging and reproducibility pipelines on GitHub.

Citation

If you use this model, please cite:

@misc{bendou2025grassworldppo,
  author       = {Ahmed El Mahdi BENDOU},
  title        = {PPO Agent trained on ToucanHush 24x24-GrassWorld Stochastic (Unity ML-Agents)},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/partzel/ToucanHush-24x24GrassWorldStochastic}},
}

Assets Pack

All assets have been custom made for this environment and you can get them for free from here

Downloads last month
19
Video Preview
loading

Collection including partzel/ToucanHush-24x24GrassWorldStochastic