Checkpoint 3

2026-05-10 12:46:14 +01:00
parent 1bb9415414
commit 2a6db038df
16 changed files with 305 additions and 662 deletions
@@ -1,18 +1,31 @@
-"""Train a PPO shepherd-dog policy on ``HerdingEnv`` with curriculum.
+"""PPO trainer for the shepherd-dog policy — EXPERIMENTAL.

-Defaults to 16 parallel ``SubprocVecEnv`` workers feeding a GPU policy.
-Saves checkpoints, the best-eval model, and the VecNormalize stats —
-all three are needed at inference time by the Webots controller.
+The deliverable pipeline is `bc_pretrain.py` (see ``training/README.md``).
+This script is kept in the tree because it implements:

-Usage::
+* PPO from scratch with curriculum over flock size + spawn area, and
+* PPO fine-tune of a behavior-cloned policy.
+
+Both ran into stability issues in our setting (long-horizon credit
+assignment for sparse pen reward, BC-degradation under PPO exploration
+noise). The abstractions are reusable for follow-up work — e.g.
+KL-regularised fine-tune with a frozen reference policy — so we leave
+the code in place.
+
+Usage (PPO from scratch)::

    python -m training.train_ppo \
        --config training/configs/ppo_default.yaml \
-        --out-dir training/runs/baseline
+        --out-dir training/runs/ppo_scratch

-To resume from a checkpoint::
+Usage (PPO fine-tune of BC)::

-    python -m training.train_ppo --resume training/runs/baseline/checkpoints/ppo_500000_steps.zip
+    python -m training.train_ppo \
+        --resume training/runs/bc_flock/policy.zip \
+        --out-dir training/runs/bc_ppo \
+        --no-vecnorm --no-curriculum --imitate-weight 0 \
+        --difficulty 1.0 --log-std -1.5 --learning-rate 5e-5 \
+        --total-timesteps 3000000
 """

 from __future__ import annotations