Checkpoint 3

2026-05-10 12:46:14 +01:00
parent 1bb9415414
commit 2a6db038df
16 changed files with 305 additions and 662 deletions
@@ -1,115 +1,96 @@
-# Shepherd Herding — Training & Inference
+# Training pipeline

-This directory holds the Gymnasium environment, PPO training script, and
-evaluation harness for the RL shepherd-dog policy. The Webots controller
-in `controllers/shepherd_dog/` loads the resulting policy at inference
-time when launched with `HERDING_MODE=rl`.
+Behavior cloning of analytic herding teachers into a neural network
+policy that runs in Webots. PPO from scratch and PPO fine-tune of BC
+were tried earlier and are kept under `train_ppo.py` as experimental
+options, but the BC route alone is what we ship.

-## Layout
+## Files

 ```
-training/
-├── herding_env.py        # gymnasium.Env — the dog is the agent
-├── train_ppo.py          # SB3 PPO entry point (vec envs, eval, curriculum)
-├── eval.py               # rollout success-rate / time-to-pen across flock sizes
-├── parity_test.py        # smoke test: shapes, determinism, baseline rollout
-├── configs/ppo_default.yaml
-├── runs/                 # tensorboard + checkpoints (gitignored)
-└── requirements.txt
+herding_env.py     — Gymnasium env (used for demo collection + eval)
+bc_pretrain.py     — supervised MSE+cosine training of an SB3 MlpPolicy
+                     against (obs, action) demos
+eval.py            — analytic teachers + BC policies, full n=1..10 grid
+parity_test.py     — shape/determinism/baseline smoke test
+train_ppo.py       — PPO trainer (experimental — see Appendix below)
+configs/           — PPO hyperparameter YAML
+runs/              — checkpoints (.gitignored)
 ```

 ## Setup

-```bash
-python -m venv .venv && source .venv/bin/activate
-pip install -r training/requirements.txt
+```
+pip install -r requirements.txt
 ```

-CPU is the default and also the recommended device — SB3's PPO with an
-MLP policy of this size runs faster on CPU than on GPU because the
-bottleneck is rollout collection, not gradient compute. The 16 SubprocVecEnv
-workers saturate ~16 CPU cores. To force CUDA anyway, pass `--device cuda`.
+CPU is the default and recommended device — SB3 PPO with an MLP policy
+of this size runs faster on CPU than GPU because the bottleneck is
+rollout collection, not gradient compute.

-## Train
+## The BC pipeline

-```bash
-# Full curriculum (1 → 10 sheep), ~5M steps, ~2–3h on a single GPU.
-python -m training.train_ppo \
-    --config training/configs/ppo_default.yaml \
-    --out-dir training/runs/baseline
+```
+# 1. Generate demos from an analytic teacher.
+#    --teacher: strombom (default), sequential, drive_only, hybrid, strombom_smooth
+python -m tools.collect_demos --teacher strombom \
+    --out demos.npz --seeds-per-n 30 --subsample 3
+
+# 2. Behavior-clone the demos into an MLP policy.
+python -m training.bc_pretrain --demos demos.npz \
+    --out runs/bc_flock --epochs 100 --net-arch 512,512
+
+# 3. Evaluate the resulting policy.
+python -m training.eval --policy runs/bc_flock \
+    --max-flock 10 --max-steps 30000 --n-seeds 5
 ```

-Outputs:
- `training/runs/baseline/best/best_model.zip` — best eval checkpoint
- `training/runs/baseline/best/vecnormalize.pkl` — observation stats
- `training/runs/baseline/checkpoints/ppo_*.zip` — periodic checkpoints
- `training/runs/baseline/tb/` — TensorBoard logs (`tensorboard --logdir`)
+Wall time: ~10 min demos + ~5 min BC training + ~5 min eval.

-To resume:
+`bc_pretrain.py` saves the **best-val_cos** snapshot, not the final
+epoch — multi-modal teachers (Strömbom's collect/drive switch) make
+training noisy and the last epoch is often worse than an earlier one.

-```bash
-python -m training.train_ppo --resume training/runs/baseline/checkpoints/ppo_500000_steps.zip
+## Available analytic teachers
+
+| Name | What it does | Best for |
+|---|---|---|
+| `strombom` | Canonical Strömbom — collect when flock is scattered, drive CoM otherwise | Tight-cohesion regime, n=1-10 |
+| `sequential` | Pick the sheep closest to the pen and drive only it | Loose-cohesion regime, n=1-10 |
+| `drive_only` | Strömbom drive without collect mode (continuous action) | Easier-to-BC alternative; less reliable than full Strömbom |
+| `hybrid` | Drive rearmost sheep when far, switch to closest near gate | Failed experiment, kept for write-up |
+| `strombom_smooth` | Sigmoid-blended Strömbom collect↔drive | Failed experiment |
+
+## Evaluating the analytic teachers directly
+
+```
+python -m training.eval --policy strombom    --max-flock 10 --max-steps 30000 --n-seeds 5
+python -m training.eval --policy sequential  --max-flock 10 --max-steps 30000 --n-seeds 5
 ```

-## Evaluate
+## Webots inference

-```bash
-# RL policy
-python -m training.eval --policy training/runs/baseline/best
+The Webots dog controller (`controllers/shepherd_dog/shepherd_dog.py`)
+loads a saved BC zip when launched in `rl` mode:

-# Strömbom baseline
-python -m training.eval --policy strombom
+```
+HERDING_POLICY_DIR=$PWD/runs/bc_flock tools/run_webots.sh 10 rl
 ```

-Prints success rate, mean steps, and mean penned-count per flock size.
-Use the same `--n-seeds` for both to get a fair RL-vs-Strömbom A/B.
+It auto-discovers a checkpoint named `policy.zip`, `best_model.zip`, or
+`final.zip` in the directory.

-## Parity / smoke test
+## Appendix — experimental PPO scripts

-```bash
-python -m training.parity_test
-```
+`train_ppo.py` contains the PPO/RL pipeline tried before BC:
+* PPO from scratch with curriculum learning over flock size + spawn area.
+* PPO fine-tune of a BC checkpoint.

-Checks observation/action shapes, deterministic seeding, the curriculum
-sampler, and a 400-step Strömbom rollout. Run this before every long
-training job — catches the boring class of bugs in seconds.
+Both ran into stability issues (PPO's exploration noise destroys BC
+weights faster than the reward signal can rebuild them; PPO from
+scratch never sees pen events often enough during random exploration to
+credit-assign the +500 done bonus).

-## Run the policy in Webots
-
-1. Train (above) — produces `training/runs/<name>/best/`.
-2. In Webots, set the dog controller's environment variables:
-
-   ```bash
-   export HERDING_MODE=rl
-   export HERDING_POLICY_DIR=$(pwd)/training/runs/baseline/best
-   webots worlds/field.wbt
-   ```
-
-   Or set them via Webots' controller args / a `.wbproj` if you prefer.
-
-3. To force the Strömbom baseline (same world, same controller):
-
-   ```bash
-   export HERDING_MODE=strombom
-   webots worlds/field.wbt
-   ```
-
-If `HERDING_MODE=rl` but the policy can't be loaded (SB3 not installed,
-zip missing, etc.), the controller logs the error and falls back to
-Strömbom automatically.
-
-## Curriculum knobs
-
-The default schedule in `configs/ppo_default.yaml` widens
-`max_n_sheep` over training. Each reset samples `n_sheep ~ U[1,
-max_n_sheep]`, so the final policy has seen every flock size from 1 to
-10 in proportion. To pin a specific size, instantiate the env with
-`HerdingEnv(n_sheep=N)` (see `eval.py`).
-
-## Reward shaping
-
-Weights live in class attributes on `HerdingEnv`. Tune from the 1-sheep
-curriculum first — if the dog can't herd a single sheep cleanly, raising
-`W_PROGRESS` or lowering `W_TIME` is usually the fix. For multi-sheep
-collapse modes (dog spins between sheep), increase `W_COMPACT` so
-tightening the flock pays.
+The script is left in place because the abstractions are sound and the
+code is reusable for follow-up work (e.g. KL-regularised fine-tune
+with a frozen reference policy). Not part of the deliverable pipeline.
@@ -1,20 +1,21 @@
-"""Behavior cloning of the sequential teacher into an SB3-compatible policy.
+"""Behavior cloning of an analytic teacher into an SB3-compatible policy.

-Trains the policy network (mean-action head) of an SB3 ``MlpPolicy`` to
-mimic the demonstrations collected by ``tools.collect_demos``. The
-saved zip is loadable via ``PPO.load(...)`` and can be passed to
-``train_ppo.py --resume`` for fine-tuning.
+Trains the policy network (mean-action head) of an SB3 ``MlpPolicy``
+to mimic the (obs, action) demonstrations produced by
+``tools.collect_demos``. The saved zip is loadable via ``PPO.load(...)``
+and is what the Webots dog controller uses in ``HERDING_MODE=rl``.

-Why this works: the teacher (sequential single-target driving) solves
-n=10 at 80%+ in our env. BC gives the RL a competent starting policy,
-so PPO doesn't have to discover behavior from scratch — it only has to
-*refine* the teacher's strategy via the sparse pen reward.
+Loss: MSE + (1 - cosine similarity). The cosine term is what stops
+the policy mean from collapsing toward zero against unit-vector
+targets. Best-by-val_cos checkpoint is restored at the end of training
+so noisy multi-modal teachers (e.g. Strömbom) don't lose progress when
+the last epoch lands on a bad gradient step.

 Usage::

    python -m training.bc_pretrain \\
        --demos training/demos.npz \\
-        --out training/runs/bc_pretrained
+        --out training/runs/bc_flock
 """

 from __future__ import annotations
@@ -80,7 +81,7 @@ def policy_forward_mean(policy, obs_batch):
 def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--demos", default="training/demos.npz")
-    parser.add_argument("--out", default="training/runs/bc_pretrained")
+    parser.add_argument("--out", default="training/runs/bc_solo")
    parser.add_argument("--epochs", type=int, default=60)
    parser.add_argument("--batch-size", type=int, default=256)
    parser.add_argument("--lr", type=float, default=1e-3)
@@ -147,6 +148,11 @@ def main():
          f"lr={args.lr}  device={args.device}")
    t_start = time.time()
    best_val = float("inf")
+    best_cos = -1.0
+    # Snapshot the best-by-val_cos policy weights and restore at the end —
+    # training is noisy on multi-modal teachers (e.g. Strömbom collect/drive),
+    # so the last epoch is often worse than an earlier one.
+    best_state = None

    def combined_loss(pred, target):
        mse = nn.functional.mse_loss(pred, target)
@@ -201,6 +207,14 @@ def main():
              f"val_mse={val_mse:.4f}  val_cos={cos_sim:+.3f}")
        if val_mse < best_val:
            best_val = val_mse
+        if cos_sim > best_cos:
+            best_cos = cos_sim
+            best_state = {k: v.detach().cpu().clone()
+                          for k, v in policy.state_dict().items()}
+
+    if best_state is not None:
+        policy.load_state_dict(best_state)
+        print(f"[bc] restored best-val_cos snapshot (cos={best_cos:.3f})")

    elapsed = time.time() - t_start
    print(f"[bc] done in {elapsed:.0f}s  best_val_mse={best_val:.4f}")
@@ -26,8 +26,8 @@ if _PROJECT_ROOT not in sys.path:
 import numpy as np

 from herding.geometry import MAX_SHEEP, PEN_ENTRY
-from herding.strombom import compute_action as strombom_action
 from herding.sequential import compute_action as sequential_action
+from herding.strombom import compute_action as strombom_action
 from training.herding_env import HerdingEnv


@@ -1,18 +1,31 @@
-"""Train a PPO shepherd-dog policy on ``HerdingEnv`` with curriculum.
+"""PPO trainer for the shepherd-dog policy — EXPERIMENTAL.

-Defaults to 16 parallel ``SubprocVecEnv`` workers feeding a GPU policy.
-Saves checkpoints, the best-eval model, and the VecNormalize stats —
-all three are needed at inference time by the Webots controller.
+The deliverable pipeline is `bc_pretrain.py` (see ``training/README.md``).
+This script is kept in the tree because it implements:

-Usage::
+* PPO from scratch with curriculum over flock size + spawn area, and
+* PPO fine-tune of a behavior-cloned policy.
+
+Both ran into stability issues in our setting (long-horizon credit
+assignment for sparse pen reward, BC-degradation under PPO exploration
+noise). The abstractions are reusable for follow-up work — e.g.
+KL-regularised fine-tune with a frozen reference policy — so we leave
+the code in place.
+
+Usage (PPO from scratch)::

    python -m training.train_ppo \
        --config training/configs/ppo_default.yaml \
-        --out-dir training/runs/baseline
+        --out-dir training/runs/ppo_scratch

-To resume from a checkpoint::
+Usage (PPO fine-tune of BC)::

-    python -m training.train_ppo --resume training/runs/baseline/checkpoints/ppo_500000_steps.zip
+    python -m training.train_ppo \
+        --resume training/runs/bc_flock/policy.zip \
+        --out-dir training/runs/bc_ppo \
+        --no-vecnorm --no-curriculum --imitate-weight 0 \
+        --difficulty 1.0 --log-std -1.5 --learning-rate 5e-5 \
+        --total-timesteps 3000000
 """

 from __future__ import annotations