TIR_PROJ/training/README.md

# Shepherd Herding — Training & Inference

This directory holds the Gymnasium environment, PPO training script, and
evaluation harness for the RL shepherd-dog policy. The Webots controller
in `controllers/shepherd_dog/` loads the resulting policy at inference
time when launched with `HERDING_MODE=rl`.

## Layout

```
training/
├── herding_env.py        # gymnasium.Env — the dog is the agent
├── train_ppo.py          # SB3 PPO entry point (vec envs, eval, curriculum)
├── eval.py               # rollout success-rate / time-to-pen across flock sizes
├── parity_test.py        # smoke test: shapes, determinism, baseline rollout
├── configs/ppo_default.yaml
├── runs/                 # tensorboard + checkpoints (gitignored)
└── requirements.txt
```

## Setup

```bash
python -m venv .venv && source .venv/bin/activate
pip install -r training/requirements.txt
```

CPU is the default and also the recommended device — SB3's PPO with an
MLP policy of this size runs faster on CPU than on GPU because the
bottleneck is rollout collection, not gradient compute. The 16 SubprocVecEnv
workers saturate ~16 CPU cores. To force CUDA anyway, pass `--device cuda`.

## Train

```bash
# Full curriculum (1 → 10 sheep), ~5M steps, ~2–3h on a single GPU.
python -m training.train_ppo \
    --config training/configs/ppo_default.yaml \
    --out-dir training/runs/baseline
```

Outputs:
- `training/runs/baseline/best/best_model.zip` — best eval checkpoint
- `training/runs/baseline/best/vecnormalize.pkl` — observation stats
- `training/runs/baseline/checkpoints/ppo_*.zip` — periodic checkpoints
- `training/runs/baseline/tb/` — TensorBoard logs (`tensorboard --logdir`)

To resume:

```bash
python -m training.train_ppo --resume training/runs/baseline/checkpoints/ppo_500000_steps.zip
```

## Evaluate

```bash
# RL policy
python -m training.eval --policy training/runs/baseline/best

# Strömbom baseline
python -m training.eval --policy strombom
```

Prints success rate, mean steps, and mean penned-count per flock size.
Use the same `--n-seeds` for both to get a fair RL-vs-Strömbom A/B.

## Parity / smoke test

```bash
python -m training.parity_test
```

Checks observation/action shapes, deterministic seeding, the curriculum
sampler, and a 400-step Strömbom rollout. Run this before every long
training job — catches the boring class of bugs in seconds.

## Run the policy in Webots

1. Train (above) — produces `training/runs/<name>/best/`.
2. In Webots, set the dog controller's environment variables:

   ```bash
   export HERDING_MODE=rl
   export HERDING_POLICY_DIR=$(pwd)/training/runs/baseline/best
   webots worlds/field.wbt
   ```

   Or set them via Webots' controller args / a `.wbproj` if you prefer.

3. To force the Strömbom baseline (same world, same controller):

   ```bash
   export HERDING_MODE=strombom
   webots worlds/field.wbt
   ```

If `HERDING_MODE=rl` but the policy can't be loaded (SB3 not installed,
zip missing, etc.), the controller logs the error and falls back to
Strömbom automatically.

## Curriculum knobs

The default schedule in `configs/ppo_default.yaml` widens
`max_n_sheep` over training. Each reset samples `n_sheep ~ U[1,
max_n_sheep]`, so the final policy has seen every flock size from 1 to
10 in proportion. To pin a specific size, instantiate the env with
`HerdingEnv(n_sheep=N)` (see `eval.py`).

## Reward shaping

Weights live in class attributes on `HerdingEnv`. Tune from the 1-sheep
curriculum first — if the dog can't herd a single sheep cleanly, raising
`W_PROGRESS` or lowering `W_TIME` is usually the fix. For multi-sheep
collapse modes (dog spins between sheep), increase `W_COMPACT` so
tightening the flock pays.