Checkpoint 3

2026-05-10 12:46:14 +01:00
parent 1bb9415414
commit 2a6db038df
16 changed files with 305 additions and 662 deletions
@@ -1,115 +1,96 @@
-# Shepherd Herding — Training & Inference
+# Training pipeline

-This directory holds the Gymnasium environment, PPO training script, and
-evaluation harness for the RL shepherd-dog policy. The Webots controller
-in `controllers/shepherd_dog/` loads the resulting policy at inference
-time when launched with `HERDING_MODE=rl`.
+Behavior cloning of analytic herding teachers into a neural network
+policy that runs in Webots. PPO from scratch and PPO fine-tune of BC
+were tried earlier and are kept under `train_ppo.py` as experimental
+options, but the BC route alone is what we ship.

-## Layout
+## Files

 ```
-training/
-├── herding_env.py        # gymnasium.Env — the dog is the agent
-├── train_ppo.py          # SB3 PPO entry point (vec envs, eval, curriculum)
-├── eval.py               # rollout success-rate / time-to-pen across flock sizes
-├── parity_test.py        # smoke test: shapes, determinism, baseline rollout
-├── configs/ppo_default.yaml
-├── runs/                 # tensorboard + checkpoints (gitignored)
-└── requirements.txt
+herding_env.py     — Gymnasium env (used for demo collection + eval)
+bc_pretrain.py     — supervised MSE+cosine training of an SB3 MlpPolicy
+                     against (obs, action) demos
+eval.py            — analytic teachers + BC policies, full n=1..10 grid
+parity_test.py     — shape/determinism/baseline smoke test
+train_ppo.py       — PPO trainer (experimental — see Appendix below)
+configs/           — PPO hyperparameter YAML
+runs/              — checkpoints (.gitignored)
 ```

 ## Setup

-```bash
-python -m venv .venv && source .venv/bin/activate
-pip install -r training/requirements.txt
+```
+pip install -r requirements.txt
 ```

-CPU is the default and also the recommended device — SB3's PPO with an
-MLP policy of this size runs faster on CPU than on GPU because the
-bottleneck is rollout collection, not gradient compute. The 16 SubprocVecEnv
-workers saturate ~16 CPU cores. To force CUDA anyway, pass `--device cuda`.
+CPU is the default and recommended device — SB3 PPO with an MLP policy
+of this size runs faster on CPU than GPU because the bottleneck is
+rollout collection, not gradient compute.

-## Train
+## The BC pipeline

-```bash
-# Full curriculum (1 → 10 sheep), ~5M steps, ~2–3h on a single GPU.
-python -m training.train_ppo \
-    --config training/configs/ppo_default.yaml \
-    --out-dir training/runs/baseline
+```
+# 1. Generate demos from an analytic teacher.
+#    --teacher: strombom (default), sequential, drive_only, hybrid, strombom_smooth
+python -m tools.collect_demos --teacher strombom \
+    --out demos.npz --seeds-per-n 30 --subsample 3
+
+# 2. Behavior-clone the demos into an MLP policy.
+python -m training.bc_pretrain --demos demos.npz \
+    --out runs/bc_flock --epochs 100 --net-arch 512,512
+
+# 3. Evaluate the resulting policy.
+python -m training.eval --policy runs/bc_flock \
+    --max-flock 10 --max-steps 30000 --n-seeds 5
 ```

-Outputs:
- `training/runs/baseline/best/best_model.zip` — best eval checkpoint
- `training/runs/baseline/best/vecnormalize.pkl` — observation stats
- `training/runs/baseline/checkpoints/ppo_*.zip` — periodic checkpoints
- `training/runs/baseline/tb/` — TensorBoard logs (`tensorboard --logdir`)
+Wall time: ~10 min demos + ~5 min BC training + ~5 min eval.

-To resume:
+`bc_pretrain.py` saves the **best-val_cos** snapshot, not the final
+epoch — multi-modal teachers (Strömbom's collect/drive switch) make
+training noisy and the last epoch is often worse than an earlier one.

-```bash
-python -m training.train_ppo --resume training/runs/baseline/checkpoints/ppo_500000_steps.zip
+## Available analytic teachers
+
+| Name | What it does | Best for |
+|---|---|---|
+| `strombom` | Canonical Strömbom — collect when flock is scattered, drive CoM otherwise | Tight-cohesion regime, n=1-10 |
+| `sequential` | Pick the sheep closest to the pen and drive only it | Loose-cohesion regime, n=1-10 |
+| `drive_only` | Strömbom drive without collect mode (continuous action) | Easier-to-BC alternative; less reliable than full Strömbom |
+| `hybrid` | Drive rearmost sheep when far, switch to closest near gate | Failed experiment, kept for write-up |
+| `strombom_smooth` | Sigmoid-blended Strömbom collect↔drive | Failed experiment |
+
+## Evaluating the analytic teachers directly
+
+```
+python -m training.eval --policy strombom    --max-flock 10 --max-steps 30000 --n-seeds 5
+python -m training.eval --policy sequential  --max-flock 10 --max-steps 30000 --n-seeds 5
 ```

-## Evaluate
+## Webots inference

-```bash
-# RL policy
-python -m training.eval --policy training/runs/baseline/best
+The Webots dog controller (`controllers/shepherd_dog/shepherd_dog.py`)
+loads a saved BC zip when launched in `rl` mode:

-# Strömbom baseline
-python -m training.eval --policy strombom
+```
+HERDING_POLICY_DIR=$PWD/runs/bc_flock tools/run_webots.sh 10 rl
 ```

-Prints success rate, mean steps, and mean penned-count per flock size.
-Use the same `--n-seeds` for both to get a fair RL-vs-Strömbom A/B.
+It auto-discovers a checkpoint named `policy.zip`, `best_model.zip`, or
+`final.zip` in the directory.

-## Parity / smoke test
+## Appendix — experimental PPO scripts

-```bash
-python -m training.parity_test
-```
+`train_ppo.py` contains the PPO/RL pipeline tried before BC:
+* PPO from scratch with curriculum learning over flock size + spawn area.
+* PPO fine-tune of a BC checkpoint.

-Checks observation/action shapes, deterministic seeding, the curriculum
-sampler, and a 400-step Strömbom rollout. Run this before every long
-training job — catches the boring class of bugs in seconds.
+Both ran into stability issues (PPO's exploration noise destroys BC
+weights faster than the reward signal can rebuild them; PPO from
+scratch never sees pen events often enough during random exploration to
+credit-assign the +500 done bonus).

-## Run the policy in Webots
-
-1. Train (above) — produces `training/runs/<name>/best/`.
-2. In Webots, set the dog controller's environment variables:
-
-   ```bash
-   export HERDING_MODE=rl
-   export HERDING_POLICY_DIR=$(pwd)/training/runs/baseline/best
-   webots worlds/field.wbt
-   ```
-
-   Or set them via Webots' controller args / a `.wbproj` if you prefer.
-
-3. To force the Strömbom baseline (same world, same controller):
-
-   ```bash
-   export HERDING_MODE=strombom
-   webots worlds/field.wbt
-   ```
-
-If `HERDING_MODE=rl` but the policy can't be loaded (SB3 not installed,
-zip missing, etc.), the controller logs the error and falls back to
-Strömbom automatically.
-
-## Curriculum knobs
-
-The default schedule in `configs/ppo_default.yaml` widens
-`max_n_sheep` over training. Each reset samples `n_sheep ~ U[1,
-max_n_sheep]`, so the final policy has seen every flock size from 1 to
-10 in proportion. To pin a specific size, instantiate the env with
-`HerdingEnv(n_sheep=N)` (see `eval.py`).
-
-## Reward shaping
-
-Weights live in class attributes on `HerdingEnv`. Tune from the 1-sheep
-curriculum first — if the dog can't herd a single sheep cleanly, raising
-`W_PROGRESS` or lowering `W_TIME` is usually the fix. For multi-sheep
-collapse modes (dog spins between sheep), increase `W_COMPACT` so
-tightening the flock pays.
+The script is left in place because the abstractions are sound and the
+code is reusable for follow-up work (e.g. KL-regularised fine-tune
+with a frozen reference policy). Not part of the deliverable pipeline.