# Training and Evaluation Details This file is the command-level companion to the root README. It focuses on data collection, BC, PPO fine-tuning, evaluation flags, and generated artifacts; use the root README for the high-level architecture and Webots demo quick start. Two stages, strictly sequential: ``` sim demos (Strömbom on tracker output, K=4 frame stack) │ ▼ bc/pretrain.py ──► runs/bc (Strömbom-imitated MLP) │ ▼ KL-regularised PPO fine-tune │ runs/rl (deployed `rl` mode — beats BC and Strömbom) ``` ## Files ``` herding_env.py — Gymnasium env (LiDAR raycast + tracker by default) bc/pretrain.py — MSE + cosine BC of (obs, action) demos into MlpPolicy rl/train.py — KL-regularised PPO fine-tune of a BC checkpoint eval.py — multi-seed analytic / learned policy comparison runs/ — checkpoints (whitelisted entries in top-level .gitignore) (Unit + integration tests live in the top-level ``tests/`` directory; run with ``python -m pytest tests/``.) ``` ## End-to-end pipeline The simplest way to run everything is the Makefile at the project root: ``make`` does the full chain, ``make rl`` rebuilds whatever's needed up to that point, etc. The individual stages below are kept explicit for cases where you want to tune a single step. ```bash # 1. Sim demos with the active-scan + Strömbom teacher under LiDAR # perception. K=4 frame stack so the MLP has temporal context. python -m training.bc.collect --teacher strombom \ --out training/bc/demos.npz --seeds-per-n 15 --subsample 3 --frame-stack 4 # 2. Behaviour-clone. python -m training.bc.pretrain --demos training/bc/demos.npz \ --out training/runs/bc --epochs 60 --net-arch 512,512 # 3. KL-regularised PPO fine-tune of bc. python -m training.rl.train \ --bc training/runs/bc --out training/runs/rl \ --total-timesteps 1000000 # 4. Multi-seed eval (env-side, fast). python -m training.eval --policy training/runs/rl \ --max-flock 10 --max-steps 15000 --n-seeds 10 ``` `bc/pretrain.py` saves the **best-val_cos** snapshot, not the final epoch — multi-modal teachers make training noisy and the last epoch is often worse than an earlier one. `rl/train.py` loads BC weights into both a trainable policy and a frozen reference, fixes `log_std` small, and adds `β · KL(π‖π_ref)` to the loss so the policy can only move within a trust region around BC. See the file header for hyperparameter rationale. ## Available analytic teachers | Name | What it does | Notes | |---|---|---| | `strombom` | Strömbom 2014 — collect when flock is scattered, drive CoM otherwise | Default; works for n=1–10 under tight cohesion | | `sequential` | Pick the sheep closest to the pen and drive only it | Alternative; needs loose-cohesion regime | Both are wrapped at demo-collection time in `herding/control/active_scan.py:ActiveScanTeacher`, which adds an opening in-place rotation, walk-to-centre when the LiDAR sees nothing, and near-sheep speed modulation (same modulation `herding/control/modulation.py` applies to every dog mode at inference). ## Evaluating analytic teachers directly ``` python -m training.eval --policy strombom --max-flock 10 --max-steps 15000 --n-seeds 10 python -m training.eval --policy sequential --max-flock 10 --max-steps 15000 --n-seeds 10 ```