Checkpoint 4

2026-05-11 00:42:52 +01:00
parent 2a6db038df
commit 6688325d89
26 changed files with 2018 additions and 503 deletions
@@ -3,16 +3,39 @@
 Group G25 — *Diogo Costa, Johnny Fernandes, Nelson Neto*

 A differential-drive shepherd dog that herds 1–10 sheep through a 3 m
-gate into an external pen. The dog has three modes:
+gate into an external pen. The dog has three deployable modes:

-| Mode | Source | Notes |
+| Mode | Source | Role |
 |---|---|---|
-| `rl` | Behavior cloning of an analytic teacher | The deliverable RL policy |
-| `strombom` | Strömbom (2014) collect/drive heuristic | Canonical baseline |
-| `sequential` | Single-target "pin and push" | Robust across n=1–10 |
+| `strombom` | Strömbom et al. (2014) collect/drive heuristic | Analytic baseline |
+| `bc` | Behaviour cloning of the Strömbom teacher | Imitation learning result |
+| `rl` | KL-regularised PPO fine-tune of `bc` | Reward-driven refinement |

-Plus three documented experimental teachers (`hybrid`, `drive_only`,
-`strombom_smooth`) — see `herding/` for details.
+`sequential` (single-target pin-and-push) is kept as an alternative
+analytic baseline. `dagger` is a data-collection mode, not deployment.
+
+## Perception
+
+The dog perceives sheep **only through its front-mounted 140° LiDAR**
+(180 rays, 12 m max range — see `protos/ShepherdDog.proto`). Each
+control step:
+
+1. Read `lidar.getRangeImage()`,
+2. Cluster returns into world-frame `(x, y)` estimates
+   (`herding/lidar_perception.py`),
+3. Fold them into a multi-target tracker that maintains last-seen
+   positions for sheep currently outside the FOV
+   (`herding/sheep_tracker.py`).
+
+The tracker outputs a `{name: (x, y)}` dict shaped exactly like the
+prior receiver-based one, so Strömbom, Sequential, and the BC obs
+builder all run unchanged on top of it. The 2D Gymnasium env
+(`herding/lidar_sim.py`) raycasts sheep discs at training time, so
+demos collected in the env match the perception the deployed
+controller sees in Webots.
+
+Privileged ground-truth perception is available for ablation —
+`HerdingEnv(use_lidar=False)`.

 ## Quick start

@@ -23,20 +46,30 @@ pip install -r training/requirements.txt
 # 2. Smoke test
 python -m training.parity_test

-# 3. Reproduce the BC policy from scratch (~25 min on CPU)
-python -m tools.collect_demos --teacher strombom --out training/demos.npz \
-    --seeds-per-n 30 --subsample 3
-python -m training.bc_pretrain --demos training/demos.npz \
-    --out training/runs/bc_flock --epochs 100 --net-arch 512,512
+# 3. Reproduce the BC policy (~10 min on CPU: ~5 min demos + ~3 min BC)
+python -m tools.collect_demos --teacher strombom \
+    --out training/demos_v3.npz --seeds-per-n 15 --subsample 3 --frame-stack 4
+python -m training.bc_pretrain --demos training/demos_v3.npz \
+    --out training/runs/bc_v3 --epochs 60 --net-arch 512,512

-# 4. Evaluate
-python -m training.eval --policy training/runs/bc_flock \
-    --max-flock 10 --max-steps 30000 --n-seeds 5
+# 4. Optional: DAgger from inside Webots if sim-trained doesn't transfer
+tools/auto_dagger.sh 3 60
+python -m tools.dagger_merge_train --out training/runs/bc_dagger

-# 5. Run in Webots (any of the three modes; n is the flock size)
-HERDING_POLICY_DIR=$PWD/training/runs/bc_flock tools/run_webots.sh 10 rl
-tools/run_webots.sh 10 strombom
-tools/run_webots.sh 10 sequential
+# 5. Evaluate (env)
+python -m training.eval --policy training/runs/bc_v3 \
+    --max-flock 10 --max-steps 8000 --n-seeds 5
+
+# 6. Optional RL fine-tune of the BC policy (~40 min on CPU, 1 M steps)
+python -m training.train_ppo \
+    --bc training/runs/bc_v3 \
+    --out training/runs/rl_v1 \
+    --total-timesteps 1000000
+
+# 7. Run in Webots
+tools/run_webots.sh 10 bc          # behaviour-cloned MLP
+tools/run_webots.sh 10 rl          # KL-PPO fine-tune
+tools/run_webots.sh 10 strombom    # analytic baseline
 ```

 ## Layout
@@ -46,69 +79,71 @@ herding/                  — single source of truth (env + Webots both import)
  geometry.py             — field/pen constants, robot specs
  flocking_sim.py         — Reynolds-style sheep dynamics
  diffdrive.py            — differential-drive kinematics
+  control.py              — shared near-sheep speed-modulation helper
  obs.py                  — 32-D order-invariant observation builder
  strombom.py             — canonical CoM-drive teacher
  sequential.py           — single-target "pin-and-push" teacher
-  hybrid.py               — flock-then-funnel (experimental, did not scale)
-  drive_only.py           — Strömbom drive without collect (experimental)
-  strombom_smooth.py      — sigmoid-blended Strömbom (experimental)
+  active_scan.py          — wraps a base teacher with opening rotation +
+                            walk-to-centre + speed modulation
+  lidar_sim.py            — fast 2D raycast for the env (sheep + walls + posts)
+  lidar_perception.py     — scan → world-frame cluster centroids + filters
+  sheep_tracker.py        — multi-target NN tracker with FOV memory

 controllers/
  sheep/sheep.py          — Webots sheep controller (uses herding.flocking_sim)
  shepherd_dog/
    shepherd_dog.py       — Webots dog controller, mode-switched
-    policy_loader.py      — lazy SB3 PPO loader
-    strombom.py           — backwards-compat shim
+    policy_loader.py      — lazy SB3 policy loader (auto-detects frame stack)

 training/
-  herding_env.py          — Gymnasium env (used for demo collection + eval)
-  bc_pretrain.py          — supervised BC of analytic teachers into MLP policy
-  collect_demos.py — wrapper, see tools/
-  eval.py                 — RL / analytic comparison harness
-  parity_test.py          — smoke tests
-  train_ppo.py            — PPO/RL fine-tune (experimental, BC alone preferred)
+  herding_env.py          — Gymnasium env (LiDAR + tracker by default)
+  bc_pretrain.py          — supervised BC of (obs, action) demos into MLP
+  eval.py                 — analytic + BC policy comparison harness
+  parity_test.py          — shape / determinism smoke test
+  runs/                   — checkpoints (whitelisted in .gitignore)
  requirements.txt
-  configs/ppo_default.yaml

 tools/
-  collect_demos.py        — generate (obs, action) demonstrations
-  run_webots.sh           — launch Webots with N sheep + chosen controller mode
+  collect_demos.py        — sim demos via the active-scan teacher
+  dagger_merge_train.py   — merge Webots-collected DAgger demos and retrain
+  run_webots.sh           — launch Webots with N sheep + chosen mode
+  auto_dagger.sh          — headless DAgger collection across many runs

 worlds/
  field.wbt               — main world (3 m gate, external pen)

 protos/                   — Sheep / ShepherdDog robot definitions
 docs/project.md           — original project goals
-plan.md                   — design notes / decision log
 ```

-## Two cohesion regimes
+## Shared low-level control

-Sheep cohesion strength controls which teacher works:
+Every dog mode (RL, Strömbom, Sequential, the DAgger teacher) routes
+its action through `herding/control.py:modulate_speed_near_sheep`,
+which scales action magnitude down when within ~2.5 m of the nearest
+tracked sheep. This stops the dog from charging in at full speed and
+scattering the flock. Direction (intent) is preserved.

-| Regime | `flocking_sim.py` setting | Strömbom | Sequential |
-|---|---|---:|---:|
-| **Tight** (current) | `w=3.0/1.0`, `dist=12` | works (flock-style) | breaks (cohesion fights single-sheep targeting) |
-| Loose | `w=1.5/0.6`, `dist=8` | breaks (flock fragments at gate) | works (1-by-1 style) |
+All modes also share the same EMA action smoother in
+`controllers/shepherd_dog/shepherd_dog.py:ACTION_SMOOTH = 0.55`.

-The codebase ships with the **tight** regime. To use the loose-regime
-Sequential clone, edit those constants in `herding/flocking_sim.py` and
-load `training/runs/bc_solo/`.
+## Webots results (steps to all-penned, fast mode)

-## Results
+Single seed per cell using `worlds/field.wbt` defaults. All modes hit
+100 % pen rate; numbers shown are time-to-all-penned in simulation
+steps (16 ms each).

-Eval at `--max-steps 30000 --n-seeds 5`, deployment difficulty (full
-field spawn distribution):
-
-| n | Strömbom | Sequential | BC-flock (RL) |
+| n  | Strömbom | `bc` | `rl` (KL-PPO of `bc`) |
 |---:|---:|---:|---:|
-| 1 | 100 % | 100 % | 100 % |
-| 5 | 100 % | 100 % | 80–100 % |
-| 8 | 100 % | 100 % | 80 % |
-| 10 | **100 %** | 80 % | **80 %** (mean_penned 8/10) |
+|  3 | 5 800 | 9 800 | **4 800** |
+|  5 | 10 200 | 9 200 | 9 800 |
+|  8 | 14 000 | 17 600 | **15 400** |
+| 10 | 18 600 | 19 600 | **12 000** |

-The BC policy hits ~80 % of the analytic teacher's success rate in 100 %
-neural-network inference, with no hand-coded logic.
+The RL fine-tune is **39 % faster than `bc` on n=10** and **51 % faster
+on n=3**, confirming the KL-anchored PPO actually finds reward-driven
+improvements over the BC imitation baseline rather than just collapsing
+back to it.

 ## License