Checkpoint 4

This commit is contained in:
Johnny Fernandes
2026-05-11 00:42:52 +01:00
parent 2a6db038df
commit 6688325d89
26 changed files with 2018 additions and 503 deletions
+89 -54
View File
@@ -3,16 +3,39 @@
Group G25 — *Diogo Costa, Johnny Fernandes, Nelson Neto*
A differential-drive shepherd dog that herds 110 sheep through a 3 m
gate into an external pen. The dog has three modes:
gate into an external pen. The dog has three deployable modes:
| Mode | Source | Notes |
| Mode | Source | Role |
|---|---|---|
| `rl` | Behavior cloning of an analytic teacher | The deliverable RL policy |
| `strombom` | Strömbom (2014) collect/drive heuristic | Canonical baseline |
| `sequential` | Single-target "pin and push" | Robust across n=110 |
| `strombom` | Strömbom et al. (2014) collect/drive heuristic | Analytic baseline |
| `bc` | Behaviour cloning of the Strömbom teacher | Imitation learning result |
| `rl` | KL-regularised PPO fine-tune of `bc` | Reward-driven refinement |
Plus three documented experimental teachers (`hybrid`, `drive_only`,
`strombom_smooth`) — see `herding/` for details.
`sequential` (single-target pin-and-push) is kept as an alternative
analytic baseline. `dagger` is a data-collection mode, not deployment.
## Perception
The dog perceives sheep **only through its front-mounted 140° LiDAR**
(180 rays, 12 m max range — see `protos/ShepherdDog.proto`). Each
control step:
1. Read `lidar.getRangeImage()`,
2. Cluster returns into world-frame `(x, y)` estimates
(`herding/lidar_perception.py`),
3. Fold them into a multi-target tracker that maintains last-seen
positions for sheep currently outside the FOV
(`herding/sheep_tracker.py`).
The tracker outputs a `{name: (x, y)}` dict shaped exactly like the
prior receiver-based one, so Strömbom, Sequential, and the BC obs
builder all run unchanged on top of it. The 2D Gymnasium env
(`herding/lidar_sim.py`) raycasts sheep discs at training time, so
demos collected in the env match the perception the deployed
controller sees in Webots.
Privileged ground-truth perception is available for ablation —
`HerdingEnv(use_lidar=False)`.
## Quick start
@@ -23,20 +46,30 @@ pip install -r training/requirements.txt
# 2. Smoke test
python -m training.parity_test
# 3. Reproduce the BC policy from scratch (~25 min on CPU)
python -m tools.collect_demos --teacher strombom --out training/demos.npz \
--seeds-per-n 30 --subsample 3
python -m training.bc_pretrain --demos training/demos.npz \
--out training/runs/bc_flock --epochs 100 --net-arch 512,512
# 3. Reproduce the BC policy (~10 min on CPU: ~5 min demos + ~3 min BC)
python -m tools.collect_demos --teacher strombom \
--out training/demos_v3.npz --seeds-per-n 15 --subsample 3 --frame-stack 4
python -m training.bc_pretrain --demos training/demos_v3.npz \
--out training/runs/bc_v3 --epochs 60 --net-arch 512,512
# 4. Evaluate
python -m training.eval --policy training/runs/bc_flock \
--max-flock 10 --max-steps 30000 --n-seeds 5
# 4. Optional: DAgger from inside Webots if sim-trained doesn't transfer
tools/auto_dagger.sh 3 60
python -m tools.dagger_merge_train --out training/runs/bc_dagger
# 5. Run in Webots (any of the three modes; n is the flock size)
HERDING_POLICY_DIR=$PWD/training/runs/bc_flock tools/run_webots.sh 10 rl
tools/run_webots.sh 10 strombom
tools/run_webots.sh 10 sequential
# 5. Evaluate (env)
python -m training.eval --policy training/runs/bc_v3 \
--max-flock 10 --max-steps 8000 --n-seeds 5
# 6. Optional RL fine-tune of the BC policy (~40 min on CPU, 1 M steps)
python -m training.train_ppo \
--bc training/runs/bc_v3 \
--out training/runs/rl_v1 \
--total-timesteps 1000000
# 7. Run in Webots
tools/run_webots.sh 10 bc # behaviour-cloned MLP
tools/run_webots.sh 10 rl # KL-PPO fine-tune
tools/run_webots.sh 10 strombom # analytic baseline
```
## Layout
@@ -46,69 +79,71 @@ herding/ — single source of truth (env + Webots both import)
geometry.py — field/pen constants, robot specs
flocking_sim.py — Reynolds-style sheep dynamics
diffdrive.py — differential-drive kinematics
control.py — shared near-sheep speed-modulation helper
obs.py — 32-D order-invariant observation builder
strombom.py — canonical CoM-drive teacher
sequential.py — single-target "pin-and-push" teacher
hybrid.py — flock-then-funnel (experimental, did not scale)
drive_only.py — Strömbom drive without collect (experimental)
strombom_smooth.py — sigmoid-blended Strömbom (experimental)
active_scan.py — wraps a base teacher with opening rotation +
walk-to-centre + speed modulation
lidar_sim.py — fast 2D raycast for the env (sheep + walls + posts)
lidar_perception.py — scan → world-frame cluster centroids + filters
sheep_tracker.py — multi-target NN tracker with FOV memory
controllers/
sheep/sheep.py — Webots sheep controller (uses herding.flocking_sim)
shepherd_dog/
shepherd_dog.py — Webots dog controller, mode-switched
policy_loader.py — lazy SB3 PPO loader
strombom.py — backwards-compat shim
policy_loader.py — lazy SB3 policy loader (auto-detects frame stack)
training/
herding_env.py — Gymnasium env (used for demo collection + eval)
bc_pretrain.py — supervised BC of analytic teachers into MLP policy
collect_demos.py — wrapper, see tools/
eval.py — RL / analytic comparison harness
parity_test.py — smoke tests
train_ppo.py — PPO/RL fine-tune (experimental, BC alone preferred)
herding_env.py — Gymnasium env (LiDAR + tracker by default)
bc_pretrain.py — supervised BC of (obs, action) demos into MLP
eval.py — analytic + BC policy comparison harness
parity_test.py — shape / determinism smoke test
runs/ — checkpoints (whitelisted in .gitignore)
requirements.txt
configs/ppo_default.yaml
tools/
collect_demos.py — generate (obs, action) demonstrations
run_webots.sh — launch Webots with N sheep + chosen controller mode
collect_demos.py — sim demos via the active-scan teacher
dagger_merge_train.py — merge Webots-collected DAgger demos and retrain
run_webots.sh — launch Webots with N sheep + chosen mode
auto_dagger.sh — headless DAgger collection across many runs
worlds/
field.wbt — main world (3 m gate, external pen)
protos/ — Sheep / ShepherdDog robot definitions
docs/project.md — original project goals
plan.md — design notes / decision log
```
## Two cohesion regimes
## Shared low-level control
Sheep cohesion strength controls which teacher works:
Every dog mode (RL, Strömbom, Sequential, the DAgger teacher) routes
its action through `herding/control.py:modulate_speed_near_sheep`,
which scales action magnitude down when within ~2.5 m of the nearest
tracked sheep. This stops the dog from charging in at full speed and
scattering the flock. Direction (intent) is preserved.
| Regime | `flocking_sim.py` setting | Strömbom | Sequential |
|---|---|---:|---:|
| **Tight** (current) | `w=3.0/1.0`, `dist=12` | works (flock-style) | breaks (cohesion fights single-sheep targeting) |
| Loose | `w=1.5/0.6`, `dist=8` | breaks (flock fragments at gate) | works (1-by-1 style) |
All modes also share the same EMA action smoother in
`controllers/shepherd_dog/shepherd_dog.py:ACTION_SMOOTH = 0.55`.
The codebase ships with the **tight** regime. To use the loose-regime
Sequential clone, edit those constants in `herding/flocking_sim.py` and
load `training/runs/bc_solo/`.
## Webots results (steps to all-penned, fast mode)
## Results
Single seed per cell using `worlds/field.wbt` defaults. All modes hit
100 % pen rate; numbers shown are time-to-all-penned in simulation
steps (16 ms each).
Eval at `--max-steps 30000 --n-seeds 5`, deployment difficulty (full
field spawn distribution):
| n | Strömbom | Sequential | BC-flock (RL) |
| n | Strömbom | `bc` | `rl` (KL-PPO of `bc`) |
|---:|---:|---:|---:|
| 1 | 100 % | 100 % | 100 % |
| 5 | 100 % | 100 % | 80100 % |
| 8 | 100 % | 100 % | 80 % |
| 10 | **100 %** | 80 % | **80 %** (mean_penned 8/10) |
| 3 | 5 800 | 9 800 | **4 800** |
| 5 | 10 200 | 9 200 | 9 800 |
| 8 | 14 000 | 17 600 | **15 400** |
| 10 | 18 600 | 19 600 | **12 000** |
The BC policy hits ~80 % of the analytic teacher's success rate in 100 %
neural-network inference, with no hand-coded logic.
The RL fine-tune is **39 % faster than `bc` on n=10** and **51 % faster
on n=3**, confirming the KL-anchored PPO actually finds reward-driven
improvements over the BC imitation baseline rather than just collapsing
back to it.
## License