Shepherd Herding — Training & Inference

This directory holds the Gymnasium environment, PPO training script, and evaluation harness for the RL shepherd-dog policy. The Webots controller in controllers/shepherd_dog/ loads the resulting policy at inference time when launched with HERDING_MODE=rl.

Layout

training/
├── herding_env.py        # gymnasium.Env — the dog is the agent
├── train_ppo.py          # SB3 PPO entry point (vec envs, eval, curriculum)
├── eval.py               # rollout success-rate / time-to-pen across flock sizes
├── parity_test.py        # smoke test: shapes, determinism, baseline rollout
├── configs/ppo_default.yaml
├── runs/                 # tensorboard + checkpoints (gitignored)
└── requirements.txt

Setup

python -m venv .venv && source .venv/bin/activate
pip install -r training/requirements.txt

CPU is the default and also the recommended device — SB3's PPO with an MLP policy of this size runs faster on CPU than on GPU because the bottleneck is rollout collection, not gradient compute. The 16 SubprocVecEnv workers saturate ~16 CPU cores. To force CUDA anyway, pass --device cuda.

Train

# Full curriculum (1 → 10 sheep), ~5M steps, ~2–3h on a single GPU.
python -m training.train_ppo \
    --config training/configs/ppo_default.yaml \
    --out-dir training/runs/baseline

Outputs:

training/runs/baseline/best/best_model.zip — best eval checkpoint
training/runs/baseline/best/vecnormalize.pkl — observation stats
training/runs/baseline/checkpoints/ppo_*.zip — periodic checkpoints
training/runs/baseline/tb/ — TensorBoard logs (tensorboard --logdir)

To resume:

python -m training.train_ppo --resume training/runs/baseline/checkpoints/ppo_500000_steps.zip

Evaluate

# RL policy
python -m training.eval --policy training/runs/baseline/best

# Strömbom baseline
python -m training.eval --policy strombom

Prints success rate, mean steps, and mean penned-count per flock size. Use the same --n-seeds for both to get a fair RL-vs-Strömbom A/B.

Parity / smoke test

python -m training.parity_test

Checks observation/action shapes, deterministic seeding, the curriculum sampler, and a 400-step Strömbom rollout. Run this before every long training job — catches the boring class of bugs in seconds.

Run the policy in Webots

Train (above) — produces training/runs/<name>/best/.
In Webots, set the dog controller's environment variables:
```
export HERDING_MODE=rl
export HERDING_POLICY_DIR=$(pwd)/training/runs/baseline/best
webots worlds/field.wbt
```
Or set them via Webots' controller args / a .wbproj if you prefer.
To force the Strömbom baseline (same world, same controller):
```
export HERDING_MODE=strombom
webots worlds/field.wbt
```

If HERDING_MODE=rl but the policy can't be loaded (SB3 not installed, zip missing, etc.), the controller logs the error and falls back to Strömbom automatically.

Curriculum knobs

The default schedule in configs/ppo_default.yaml widens max_n_sheep over training. Each reset samples n_sheep ~ U[1, max_n_sheep], so the final policy has seen every flock size from 1 to 10 in proportion. To pin a specific size, instantiate the env with HerdingEnv(n_sheep=N) (see eval.py).

Reward shaping

Weights live in class attributes on HerdingEnv. Tune from the 1-sheep curriculum first — if the dog can't herd a single sheep cleanly, raising W_PROGRESS or lowering W_TIME is usually the fix. For multi-sheep collapse modes (dog spins between sheep), increase W_COMPACT so tightening the flock pays.

3.7 KiB Raw Blame History Unescape Escape