Checkpoint 3
This commit is contained in:
+21
-8
@@ -1,18 +1,31 @@
|
||||
"""Train a PPO shepherd-dog policy on ``HerdingEnv`` with curriculum.
|
||||
"""PPO trainer for the shepherd-dog policy — EXPERIMENTAL.
|
||||
|
||||
Defaults to 16 parallel ``SubprocVecEnv`` workers feeding a GPU policy.
|
||||
Saves checkpoints, the best-eval model, and the VecNormalize stats —
|
||||
all three are needed at inference time by the Webots controller.
|
||||
The deliverable pipeline is `bc_pretrain.py` (see ``training/README.md``).
|
||||
This script is kept in the tree because it implements:
|
||||
|
||||
Usage::
|
||||
* PPO from scratch with curriculum over flock size + spawn area, and
|
||||
* PPO fine-tune of a behavior-cloned policy.
|
||||
|
||||
Both ran into stability issues in our setting (long-horizon credit
|
||||
assignment for sparse pen reward, BC-degradation under PPO exploration
|
||||
noise). The abstractions are reusable for follow-up work — e.g.
|
||||
KL-regularised fine-tune with a frozen reference policy — so we leave
|
||||
the code in place.
|
||||
|
||||
Usage (PPO from scratch)::
|
||||
|
||||
python -m training.train_ppo \
|
||||
--config training/configs/ppo_default.yaml \
|
||||
--out-dir training/runs/baseline
|
||||
--out-dir training/runs/ppo_scratch
|
||||
|
||||
To resume from a checkpoint::
|
||||
Usage (PPO fine-tune of BC)::
|
||||
|
||||
python -m training.train_ppo --resume training/runs/baseline/checkpoints/ppo_500000_steps.zip
|
||||
python -m training.train_ppo \
|
||||
--resume training/runs/bc_flock/policy.zip \
|
||||
--out-dir training/runs/bc_ppo \
|
||||
--no-vecnorm --no-curriculum --imitate-weight 0 \
|
||||
--difficulty 1.0 --log-std -1.5 --learning-rate 5e-5 \
|
||||
--total-timesteps 3000000
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
Reference in New Issue
Block a user