Files
TIR_PROJ/training/README.md
T
Johnny Fernandes 10c01a938e Drop versioning vocabulary, polish docstrings, fix world-aware policy resolution
User-facing pass after the project was decided to be a single
submission with no inner iterations.

* Remove every "v1"/"v2"/"versioning" reference from the docs:
  - README mecanum section trims the "v1 predates the rewrite" prose
    in favour of a self-contained retrain recipe.
  - The 3.2 GB `training/runs/v1_clean/` backup directory is deleted.
* Refresh control-layer docstrings:
  - `sheep_tracker.py` header now describes the three actual pipeline
    stages (consensus, prediction, pen latching) instead of layering
    the consensus stage on top of a stale "predictive mode" preamble.
  - `controllers/shepherd_dog/shepherd_dog.py` mode list is
    up-to-date — adds `universal`, removes outdated single-policy
    default paths, mentions `HERDING_USE_GT=1` as the perception
    ablation.
* Refresh training command examples:
  - `training/bc/collect.py` and `training/bc/pretrain.py` usage
    snippets show the world-suffixed paths the Makefile actually
    uses; the `--out` arg is now required so old "demos.npz"
    invocations error loudly instead of silently overwriting.
  - `training/README.md` rewritten — drops the legacy `runs/bc`
    diagram, documents the per-(drive, world) pipeline, and adds
    the mecanum retraining caveat.
* Fix policy-directory resolution end-to-end:
  - `tools/run_webots.sh` now tries
    `training/runs/{bc,rl}_<drive>_<world>` first, then the drive-
    only path, then the bare-mode legacy path — matching the actual
    on-disk layout. Previously it looked for `bc_<drive>` (no
    world) and silently fell back to `bc`, masking the world
    selection.
  - `controllers/shepherd_dog/shepherd_dog.py:_resolve_policy_dir`
    has the same fix plus a latent NameError unmasked: it referenced
    `DRIVE_MODE` before that variable was set at module load. The
    block is restructured so MODE/DRIVE_MODE/WORLD are resolved
    first, then the function uses them as explicit arguments.

126 pytest cases still pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 01:50:54 +00:00

119 lines
4.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Training and evaluation details
Command-level companion to the root README. Covers demo collection,
behaviour cloning, PPO fine-tuning, and evaluation flags; use the root
README for the high-level architecture and Webots quick start.
The pipeline is two strictly-sequential stages per `(drive, world)`
combo:
```
sim demos (universal teacher on tracker output, K=4 frame stack)
bc/pretrain.py ──► runs/bc_<drive>_<world> (MLP)
▼ KL-regularised PPO fine-tune
runs/rl_<drive>_<world> (deployed `rl` mode)
```
## Files
```
herding_env.py — Gymnasium env (LiDAR raycast + tracker by default)
bc/collect.py — universal-teacher sim demos
bc/pretrain.py — MSE + cosine BC of (obs, action) demos into MlpPolicy
rl/train.py — KL-regularised PPO fine-tune of a BC checkpoint
rl/train_lstm.py — RecurrentPPO variant (ablation)
eval.py — multi-seed analytic / learned policy comparison
runs/ — checkpoints (gitignored except for policy.zip)
```
Unit + integration tests live in the top-level `tests/`. Run with
`make test` or `python -m pytest tests/`.
## End-to-end pipeline
The simplest way to train one combo is the project-root Makefile:
```bash
make DRIVE=differential WORLD=field # demos → bc → rl → eval
make DRIVE=differential WORLD=field_round
make train_all # all four combos sequentially
```
The individual stages below are kept explicit for cases where you
want to tune a single step.
```bash
# 1. Sim demos with the active-scan + universal teacher under LiDAR
# perception. K=4 frame stack so the MLP has temporal context.
python -m training.bc.collect \
--teacher universal --drive-mode differential --world field \
--out training/bc/demos_differential_field.npz \
--seeds-per-n 15 --subsample 3 --frame-stack 4
# 2. Behaviour-clone the demos.
python -m training.bc.pretrain \
--demos training/bc/demos_differential_field.npz \
--out training/runs/bc_differential_field \
--epochs 60 --net-arch 512,512
# 3. KL-regularised PPO fine-tune of bc.
python -m training.rl.train \
--bc training/runs/bc_differential_field \
--out training/runs/rl_differential_field \
--drive-mode differential --world field \
--total-timesteps 1000000
# 4. Multi-seed eval (env-side, fast).
python -m training.eval --policy training/runs/rl_differential_field \
--drive-mode differential --world field \
--max-flock 10 --max-steps 15000 --n-seeds 10
```
`bc/pretrain.py` saves the **best-val_cos** snapshot, not the final
epoch — multi-modal teachers make training noisy and the last epoch
is often worse than an earlier one.
`rl/train.py` loads BC weights into both a trainable policy and a
frozen reference, fixes `log_std` small, and adds `β · KL(π‖π_ref)` to
the loss so the policy can only move within a trust region around BC.
See the file header for hyperparameter rationale.
## Mecanum retraining
For mecanum runs, pass `--use-webots-preset`. Both `collect.py` and
`train.py` detect `--drive-mode mecanum` and switch to the
`HERDING_MEC_WEBOTS` preset, which matches the physical-roller
Webots proto's strafe efficiency (~0.4) and forward bleed (~0.28).
Training without this preset produces a policy that herds in textbook
gym mecanum but not in Webots.
## Analytic teachers
| Name | What it does | Notes |
|---|---|---|
| `strombom` | Strömbom 2014 — collect when flock is scattered, drive CoM otherwise | Round-world aware (radially-inward fallback when natural target lies outside the curved boundary) |
| `sequential` | Three-phase: collect, drive, then single-target push for the last 12 stragglers | Alternative to strombom |
| `universal` | Strömbom core + mecanum omega + last-straggler recovery | Used as the BC demo teacher |
All three are wrapped at demo-collection time in
`herding/control/active_scan.py:ActiveScanTeacher`, which adds an
opening in-place rotation, walk-to-centre when the LiDAR sees
nothing, and near-sheep speed modulation (same modulation
`herding/control/modulation.py` applies to every dog mode at
inference).
## Evaluating analytic teachers directly
```bash
python -m training.eval --policy strombom \
--drive-mode differential --world field \
--max-flock 10 --max-steps 15000 --n-seeds 10
python -m training.eval --policy sequential \
--drive-mode differential --world field_round \
--max-flock 10 --max-steps 15000 --n-seeds 10
```