Shepherd Herding — Training & Inference
This directory holds the Gymnasium environment, PPO training script, and
evaluation harness for the RL shepherd-dog policy. The Webots controller
in controllers/shepherd_dog/ loads the resulting policy at inference
time when launched with HERDING_MODE=rl.
Layout
training/
├── herding_env.py # gymnasium.Env — the dog is the agent
├── train_ppo.py # SB3 PPO entry point (vec envs, eval, curriculum)
├── eval.py # rollout success-rate / time-to-pen across flock sizes
├── parity_test.py # smoke test: shapes, determinism, baseline rollout
├── configs/ppo_default.yaml
├── runs/ # tensorboard + checkpoints (gitignored)
└── requirements.txt
Setup
python -m venv .venv && source .venv/bin/activate
pip install -r training/requirements.txt
CPU is the default and also the recommended device — SB3's PPO with an
MLP policy of this size runs faster on CPU than on GPU because the
bottleneck is rollout collection, not gradient compute. The 16 SubprocVecEnv
workers saturate ~16 CPU cores. To force CUDA anyway, pass --device cuda.
Train
# Full curriculum (1 → 10 sheep), ~5M steps, ~2–3h on a single GPU.
python -m training.train_ppo \
--config training/configs/ppo_default.yaml \
--out-dir training/runs/baseline
Outputs:
training/runs/baseline/best/best_model.zip— best eval checkpointtraining/runs/baseline/best/vecnormalize.pkl— observation statstraining/runs/baseline/checkpoints/ppo_*.zip— periodic checkpointstraining/runs/baseline/tb/— TensorBoard logs (tensorboard --logdir)
To resume:
python -m training.train_ppo --resume training/runs/baseline/checkpoints/ppo_500000_steps.zip
Evaluate
# RL policy
python -m training.eval --policy training/runs/baseline/best
# Strömbom baseline
python -m training.eval --policy strombom
Prints success rate, mean steps, and mean penned-count per flock size.
Use the same --n-seeds for both to get a fair RL-vs-Strömbom A/B.
Parity / smoke test
python -m training.parity_test
Checks observation/action shapes, deterministic seeding, the curriculum sampler, and a 400-step Strömbom rollout. Run this before every long training job — catches the boring class of bugs in seconds.
Run the policy in Webots
-
Train (above) — produces
training/runs/<name>/best/. -
In Webots, set the dog controller's environment variables:
export HERDING_MODE=rl export HERDING_POLICY_DIR=$(pwd)/training/runs/baseline/best webots worlds/field.wbtOr set them via Webots' controller args / a
.wbprojif you prefer. -
To force the Strömbom baseline (same world, same controller):
export HERDING_MODE=strombom webots worlds/field.wbt
If HERDING_MODE=rl but the policy can't be loaded (SB3 not installed,
zip missing, etc.), the controller logs the error and falls back to
Strömbom automatically.
Curriculum knobs
The default schedule in configs/ppo_default.yaml widens
max_n_sheep over training. Each reset samples n_sheep ~ U[1, max_n_sheep], so the final policy has seen every flock size from 1 to
10 in proportion. To pin a specific size, instantiate the env with
HerdingEnv(n_sheep=N) (see eval.py).
Reward shaping
Weights live in class attributes on HerdingEnv. Tune from the 1-sheep
curriculum first — if the dog can't herd a single sheep cleanly, raising
W_PROGRESS or lowering W_TIME is usually the fix. For multi-sheep
collapse modes (dog spins between sheep), increase W_COMPACT so
tightening the flock pays.