Files
TIR_PROJ/training/README.md
T
Johnny Fernandes 5c2ee4bba5 Checkpoint 8
2026-05-12 22:41:03 +01:00

91 lines
3.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Training and Evaluation Details
This file is the command-level companion to the root README. It focuses
on data collection, BC, PPO fine-tuning, evaluation flags, and generated
artifacts; use the root README for the high-level architecture and
Webots demo quick start.
Two stages, strictly sequential:
```
sim demos (Strömbom on tracker output, K=4 frame stack)
bc/pretrain.py ──► runs/bc (Strömbom-imitated MLP)
▼ KL-regularised PPO fine-tune
runs/rl (deployed `rl` mode — beats BC and Strömbom)
```
## Files
```
herding_env.py — Gymnasium env (LiDAR raycast + tracker by default)
bc/pretrain.py — MSE + cosine BC of (obs, action) demos into MlpPolicy
rl/train.py — KL-regularised PPO fine-tune of a BC checkpoint
eval.py — multi-seed analytic / learned policy comparison
runs/ — checkpoints (whitelisted entries in top-level .gitignore)
(Unit + integration tests live in the top-level ``tests/`` directory;
run with ``python -m pytest tests/``.)
```
## End-to-end pipeline
The simplest way to run everything is the Makefile at the project
root: ``make`` does the full chain, ``make rl`` rebuilds whatever's
needed up to that point, etc. The individual stages below are kept
explicit for cases where you want to tune a single step.
```bash
# 1. Sim demos with the active-scan + Strömbom teacher under LiDAR
# perception. K=4 frame stack so the MLP has temporal context.
python -m training.bc.collect --teacher strombom \
--out training/bc/demos.npz --seeds-per-n 15 --subsample 3 --frame-stack 4
# 2. Behaviour-clone.
python -m training.bc.pretrain --demos training/bc/demos.npz \
--out training/runs/bc --epochs 60 --net-arch 512,512
# 3. KL-regularised PPO fine-tune of bc.
python -m training.rl.train \
--bc training/runs/bc --out training/runs/rl \
--total-timesteps 1000000
# 4. Multi-seed eval (env-side, fast).
python -m training.eval --policy training/runs/rl \
--max-flock 10 --max-steps 15000 --n-seeds 10
```
`bc/pretrain.py` saves the **best-val_cos** snapshot, not the final
epoch — multi-modal teachers make training noisy and the last epoch is
often worse than an earlier one.
`rl/train.py` loads BC weights into both a trainable policy and a
frozen reference, fixes `log_std` small, and adds `β · KL(π‖π_ref)` to
the loss so the policy can only move within a trust region around BC.
See the file header for hyperparameter rationale.
## Available analytic teachers
| Name | What it does | Notes |
|---|---|---|
| `strombom` | Strömbom 2014 — collect when flock is scattered, drive CoM otherwise | Default; works for n=110 under tight cohesion |
| `sequential` | Pick the sheep closest to the pen and drive only it | Alternative; needs loose-cohesion regime |
Both are wrapped at demo-collection time in
`herding/control/active_scan.py:ActiveScanTeacher`, which adds an
opening in-place rotation, walk-to-centre when the LiDAR sees
nothing, and near-sheep speed modulation (same modulation
`herding/control/modulation.py` applies to every dog mode at
inference).
## Evaluating analytic teachers directly
```
python -m training.eval --policy strombom --max-flock 10 --max-steps 15000 --n-seeds 10
python -m training.eval --policy sequential --max-flock 10 --max-steps 15000 --n-seeds 10
```