Checkpoint 2

2026-05-07 22:00:10 +01:00
parent 90aa3bbcb4
commit 1bb9415414
37 changed files with 3068 additions and 2912 deletions
@@ -0,0 +1,115 @@
+# Shepherd Herding — Training & Inference
+
+This directory holds the Gymnasium environment, PPO training script, and
+evaluation harness for the RL shepherd-dog policy. The Webots controller
+in `controllers/shepherd_dog/` loads the resulting policy at inference
+time when launched with `HERDING_MODE=rl`.
+
+## Layout
+
+```
+training/
+├── herding_env.py        # gymnasium.Env — the dog is the agent
+├── train_ppo.py          # SB3 PPO entry point (vec envs, eval, curriculum)
+├── eval.py               # rollout success-rate / time-to-pen across flock sizes
+├── parity_test.py        # smoke test: shapes, determinism, baseline rollout
+├── configs/ppo_default.yaml
+├── runs/                 # tensorboard + checkpoints (gitignored)
+└── requirements.txt
+```
+
+## Setup
+
+```bash
+python -m venv .venv && source .venv/bin/activate
+pip install -r training/requirements.txt
+```
+
+CPU is the default and also the recommended device — SB3's PPO with an
+MLP policy of this size runs faster on CPU than on GPU because the
+bottleneck is rollout collection, not gradient compute. The 16 SubprocVecEnv
+workers saturate ~16 CPU cores. To force CUDA anyway, pass `--device cuda`.
+
+## Train
+
+```bash
+# Full curriculum (1 → 10 sheep), ~5M steps, ~2–3h on a single GPU.
+python -m training.train_ppo \
+    --config training/configs/ppo_default.yaml \
+    --out-dir training/runs/baseline
+```
+
+Outputs:
+- `training/runs/baseline/best/best_model.zip` — best eval checkpoint
+- `training/runs/baseline/best/vecnormalize.pkl` — observation stats
+- `training/runs/baseline/checkpoints/ppo_*.zip` — periodic checkpoints
+- `training/runs/baseline/tb/` — TensorBoard logs (`tensorboard --logdir`)
+
+To resume:
+
+```bash
+python -m training.train_ppo --resume training/runs/baseline/checkpoints/ppo_500000_steps.zip
+```
+
+## Evaluate
+
+```bash
+# RL policy
+python -m training.eval --policy training/runs/baseline/best
+
+# Strömbom baseline
+python -m training.eval --policy strombom
+```
+
+Prints success rate, mean steps, and mean penned-count per flock size.
+Use the same `--n-seeds` for both to get a fair RL-vs-Strömbom A/B.
+
+## Parity / smoke test
+
+```bash
+python -m training.parity_test
+```
+
+Checks observation/action shapes, deterministic seeding, the curriculum
+sampler, and a 400-step Strömbom rollout. Run this before every long
+training job — catches the boring class of bugs in seconds.
+
+## Run the policy in Webots
+
+1. Train (above) — produces `training/runs/<name>/best/`.
+2. In Webots, set the dog controller's environment variables:
+
+   ```bash
+   export HERDING_MODE=rl
+   export HERDING_POLICY_DIR=$(pwd)/training/runs/baseline/best
+   webots worlds/field.wbt
+   ```
+
+   Or set them via Webots' controller args / a `.wbproj` if you prefer.
+
+3. To force the Strömbom baseline (same world, same controller):
+
+   ```bash
+   export HERDING_MODE=strombom
+   webots worlds/field.wbt
+   ```
+
+If `HERDING_MODE=rl` but the policy can't be loaded (SB3 not installed,
+zip missing, etc.), the controller logs the error and falls back to
+Strömbom automatically.
+
+## Curriculum knobs
+
+The default schedule in `configs/ppo_default.yaml` widens
+`max_n_sheep` over training. Each reset samples `n_sheep ~ U[1,
+max_n_sheep]`, so the final policy has seen every flock size from 1 to
+10 in proportion. To pin a specific size, instantiate the env with
+`HerdingEnv(n_sheep=N)` (see `eval.py`).
+
+## Reward shaping
+
+Weights live in class attributes on `HerdingEnv`. Tune from the 1-sheep
+curriculum first — if the dog can't herd a single sheep cleanly, raising
+`W_PROGRESS` or lowering `W_TIME` is usually the fix. For multi-sheep
+collapse modes (dog spins between sheep), increase `W_COMPACT` so
+tightening the flock pays.