# Training pipeline Two stages, strictly sequential: ``` sim demos (Strömbom on tracker output, K=4 frame stack) │ ▼ bc_pretrain.py ──► runs/bc (Strömbom-imitated MLP) │ ▼ KL-regularised PPO fine-tune │ runs/rl (deployed `rl` mode — beats BC and Strömbom) ``` ## Files ``` herding_env.py — Gymnasium env (LiDAR raycast + tracker by default) bc_pretrain.py — MSE + cosine BC of (obs, action) demos into MlpPolicy train_ppo.py — KL-regularised PPO fine-tune of a BC checkpoint eval.py — multi-seed analytic / learned policy comparison runs/ — checkpoints (whitelisted entries in top-level .gitignore) ``` ## Setup ``` pip install -r requirements.txt ``` CPU is the default and recommended device — SB3 PPO with an MLP policy of this size runs faster on CPU than GPU because the bottleneck is rollout collection, not gradient compute. ## End-to-end pipeline ```bash # 1. Sim demos with the active-scan + Strömbom teacher under LiDAR # perception. K=4 frame stack so the MLP has temporal context. python -m tools.collect_demos --teacher strombom \ --out training/demos.npz --seeds-per-n 15 --subsample 3 --frame-stack 4 # 2. Behaviour-clone. python -m training.bc_pretrain --demos training/demos.npz \ --out training/runs/bc --epochs 60 --net-arch 512,512 # 3. KL-regularised PPO fine-tune of bc. python -m training.train_ppo \ --bc training/runs/bc --out training/runs/rl \ --total-timesteps 1000000 # 4. Multi-seed eval (env-side, fast). python -m training.eval --policy training/runs/rl \ --max-flock 10 --max-steps 15000 --n-seeds 10 ``` `bc_pretrain.py` saves the **best-val_cos** snapshot, not the final epoch — multi-modal teachers make training noisy and the last epoch is often worse than an earlier one. `train_ppo.py` loads BC weights into both a trainable policy and a frozen reference, fixes `log_std` small, and adds `β · KL(π‖π_ref)` to the loss so the policy can only move within a trust region around BC. See the file header for hyperparameter rationale. ## Available analytic teachers | Name | What it does | Notes | |---|---|---| | `strombom` | Strömbom 2014 — collect when flock is scattered, drive CoM otherwise | Default; works for n=1–10 under tight cohesion | | `sequential` | Pick the sheep closest to the pen and drive only it | Alternative; needs loose-cohesion regime | Both are wrapped at demo-collection time in `herding/control/active_scan.py:ActiveScanTeacher`, which adds an opening in-place rotation, walk-to-centre when the LiDAR sees nothing, and near-sheep speed modulation (same modulation `herding/control/modulation.py` applies to every dog mode at inference). ## Evaluating analytic teachers directly ``` python -m training.eval --policy strombom --max-flock 10 --max-steps 15000 --n-seeds 10 python -m training.eval --policy sequential --max-flock 10 --max-steps 15000 --n-seeds 10 ``` ## Webots inference ``` tools/run_webots.sh 10 bc # or rl, strombom, sequential ``` The dog controller loads `runs/bc` for `bc` mode and `runs/rl` for `rl` mode. Override with `HERDING_POLICY_DIR=…` for a specific checkpoint.