Checkpoint 3

This commit is contained in:
Johnny Fernandes
2026-05-10 12:46:14 +01:00
parent 1bb9415414
commit 2a6db038df
16 changed files with 305 additions and 662 deletions
+21 -8
View File
@@ -1,18 +1,31 @@
"""Train a PPO shepherd-dog policy on ``HerdingEnv`` with curriculum.
"""PPO trainer for the shepherd-dog policy — EXPERIMENTAL.
Defaults to 16 parallel ``SubprocVecEnv`` workers feeding a GPU policy.
Saves checkpoints, the best-eval model, and the VecNormalize stats —
all three are needed at inference time by the Webots controller.
The deliverable pipeline is `bc_pretrain.py` (see ``training/README.md``).
This script is kept in the tree because it implements:
Usage::
* PPO from scratch with curriculum over flock size + spawn area, and
* PPO fine-tune of a behavior-cloned policy.
Both ran into stability issues in our setting (long-horizon credit
assignment for sparse pen reward, BC-degradation under PPO exploration
noise). The abstractions are reusable for follow-up work — e.g.
KL-regularised fine-tune with a frozen reference policy — so we leave
the code in place.
Usage (PPO from scratch)::
python -m training.train_ppo \
--config training/configs/ppo_default.yaml \
--out-dir training/runs/baseline
--out-dir training/runs/ppo_scratch
To resume from a checkpoint::
Usage (PPO fine-tune of BC)::
python -m training.train_ppo --resume training/runs/baseline/checkpoints/ppo_500000_steps.zip
python -m training.train_ppo \
--resume training/runs/bc_flock/policy.zip \
--out-dir training/runs/bc_ppo \
--no-vecnorm --no-curriculum --imitate-weight 0 \
--difficulty 1.0 --log-std -1.5 --learning-rate 5e-5 \
--total-timesteps 3000000
"""
from __future__ import annotations