Checkpoint 7

2026-05-11 12:21:51 +01:00
parent fce0e0c786
commit a01a5c9cef
34 changed files with 1266 additions and 1038 deletions
@@ -6,7 +6,7 @@ Two stages, strictly sequential:
 sim demos (Strömbom on tracker output, K=4 frame stack)
    │
    ▼
-bc_pretrain.py  ──►  runs/bc   (Strömbom-imitated MLP)
+bc/pretrain.py  ──►  runs/bc   (Strömbom-imitated MLP)
    │
    ▼  KL-regularised PPO fine-tune
    │
@@ -17,10 +17,13 @@ runs/rl                        (deployed `rl` mode — beats BC and Strömbom)

 ```
 herding_env.py     — Gymnasium env (LiDAR raycast + tracker by default)
-bc_pretrain.py     — MSE + cosine BC of (obs, action) demos into MlpPolicy
-train_ppo.py       — KL-regularised PPO fine-tune of a BC checkpoint
+bc/pretrain.py     — MSE + cosine BC of (obs, action) demos into MlpPolicy
+rl/train.py       — KL-regularised PPO fine-tune of a BC checkpoint
 eval.py            — multi-seed analytic / learned policy comparison
 runs/              — checkpoints (whitelisted entries in top-level .gitignore)
+
+(Unit + integration tests live in the top-level ``tests/`` directory;
+run with ``python -m pytest tests/``.)
 ```

 ## Setup
@@ -35,18 +38,23 @@ rollout collection, not gradient compute.

 ## End-to-end pipeline

+The simplest way to run everything is the Makefile at the project
+root: ``make`` does the full chain, ``make rl`` rebuilds whatever's
+needed up to that point, etc. The individual stages below are kept
+explicit for cases where you want to tune a single step.
+
 ```bash
 # 1. Sim demos with the active-scan + Strömbom teacher under LiDAR
 #    perception. K=4 frame stack so the MLP has temporal context.
-python -m tools.collect_demos --teacher strombom \
-    --out training/demos.npz --seeds-per-n 15 --subsample 3 --frame-stack 4
+python -m training.bc.collect --teacher strombom \
+    --out training/bc/demos.npz --seeds-per-n 15 --subsample 3 --frame-stack 4

 # 2. Behaviour-clone.
-python -m training.bc_pretrain --demos training/demos.npz \
+python -m training.bc.pretrain --demos training/bc/demos.npz \
    --out training/runs/bc --epochs 60 --net-arch 512,512

 # 3. KL-regularised PPO fine-tune of bc.
-python -m training.train_ppo \
+python -m training.rl.train \
    --bc training/runs/bc --out training/runs/rl \
    --total-timesteps 1000000

@@ -55,11 +63,11 @@ python -m training.eval --policy training/runs/rl \
    --max-flock 10 --max-steps 15000 --n-seeds 10
 ```

-`bc_pretrain.py` saves the **best-val_cos** snapshot, not the final
+`bc/pretrain.py` saves the **best-val_cos** snapshot, not the final
 epoch — multi-modal teachers make training noisy and the last epoch is
 often worse than an earlier one.

-`train_ppo.py` loads BC weights into both a trainable policy and a
+`rl/train.py` loads BC weights into both a trainable policy and a
 frozen reference, fixes `log_std` small, and adds `β · KL(π‖π_ref)` to
 the loss so the policy can only move within a trust region around BC.
 See the file header for hyperparameter rationale.