# Training pipeline Behavior cloning of analytic herding teachers into a neural network policy that runs in Webots. PPO from scratch and PPO fine-tune of BC were tried earlier and are kept under `train_ppo.py` as experimental options, but the BC route alone is what we ship. ## Files ``` herding_env.py — Gymnasium env (used for demo collection + eval) bc_pretrain.py — supervised MSE+cosine training of an SB3 MlpPolicy against (obs, action) demos eval.py — analytic teachers + BC policies, full n=1..10 grid parity_test.py — shape/determinism/baseline smoke test train_ppo.py — PPO trainer (experimental — see Appendix below) configs/ — PPO hyperparameter YAML runs/ — checkpoints (.gitignored) ``` ## Setup ``` pip install -r requirements.txt ``` CPU is the default and recommended device — SB3 PPO with an MLP policy of this size runs faster on CPU than GPU because the bottleneck is rollout collection, not gradient compute. ## The BC pipeline ``` # 1. Generate demos from an analytic teacher. # --teacher: strombom (default), sequential, drive_only, hybrid, strombom_smooth python -m tools.collect_demos --teacher strombom \ --out demos.npz --seeds-per-n 30 --subsample 3 # 2. Behavior-clone the demos into an MLP policy. python -m training.bc_pretrain --demos demos.npz \ --out runs/bc_flock --epochs 100 --net-arch 512,512 # 3. Evaluate the resulting policy. python -m training.eval --policy runs/bc_flock \ --max-flock 10 --max-steps 30000 --n-seeds 5 ``` Wall time: ~10 min demos + ~5 min BC training + ~5 min eval. `bc_pretrain.py` saves the **best-val_cos** snapshot, not the final epoch — multi-modal teachers (Strömbom's collect/drive switch) make training noisy and the last epoch is often worse than an earlier one. ## Available analytic teachers | Name | What it does | Best for | |---|---|---| | `strombom` | Canonical Strömbom — collect when flock is scattered, drive CoM otherwise | Tight-cohesion regime, n=1-10 | | `sequential` | Pick the sheep closest to the pen and drive only it | Loose-cohesion regime, n=1-10 | | `drive_only` | Strömbom drive without collect mode (continuous action) | Easier-to-BC alternative; less reliable than full Strömbom | | `hybrid` | Drive rearmost sheep when far, switch to closest near gate | Failed experiment, kept for write-up | | `strombom_smooth` | Sigmoid-blended Strömbom collect↔drive | Failed experiment | ## Evaluating the analytic teachers directly ``` python -m training.eval --policy strombom --max-flock 10 --max-steps 30000 --n-seeds 5 python -m training.eval --policy sequential --max-flock 10 --max-steps 30000 --n-seeds 5 ``` ## Webots inference The Webots dog controller (`controllers/shepherd_dog/shepherd_dog.py`) loads a saved BC zip when launched in `rl` mode: ``` HERDING_POLICY_DIR=$PWD/runs/bc_flock tools/run_webots.sh 10 rl ``` It auto-discovers a checkpoint named `policy.zip`, `best_model.zip`, or `final.zip` in the directory. ## Appendix — experimental PPO scripts `train_ppo.py` contains the PPO/RL pipeline tried before BC: * PPO from scratch with curriculum learning over flock size + spawn area. * PPO fine-tune of a BC checkpoint. Both ran into stability issues (PPO's exploration noise destroys BC weights faster than the reward signal can rebuild them; PPO from scratch never sees pen events often enough during random exploration to credit-assign the +500 done bonus). The script is left in place because the abstractions are sound and the code is reusable for follow-up work (e.g. KL-regularised fine-tune with a frozen reference policy). Not part of the deliverable pipeline.