Checkpoint 5 - incomplete

2026-05-11 10:35:39 +01:00
parent 6688325d89
commit b457155538
13 changed files with 174 additions and 74 deletions
@@ -27,6 +27,15 @@ control step:
   positions for sheep currently outside the FOV
   (`herding/sheep_tracker.py`).

+**LiDAR validation** (intermediate-goal item v from `docs/project.md`):
+run the dog controller in `HERDING_MODE=diag` mode to capture 80
+real Webots scans plus the ground-truth sheep positions in
+`training/dagger/diag_<ts>.npz`. Comparing detections against GT in
+that file showed clustered centroids match GT positions within 0.15 m
+after the +SHEEP_RADIUS surface-to-centre correction — i.e. the
+LiDAR pipeline produces correct sheep-position estimates from the
+real Webots scan, validating the sensor for the herding task.
+
 The tracker outputs a `{name: (x, y)}` dict shaped exactly like the
 prior receiver-based one, so Strömbom, Sequential, and the BC obs
 builder all run unchanged on top of it. The 2D Gymnasium env
@@ -48,22 +57,22 @@ python -m training.parity_test

 # 3. Reproduce the BC policy (~10 min on CPU: ~5 min demos + ~3 min BC)
 python -m tools.collect_demos --teacher strombom \
-    --out training/demos_v3.npz --seeds-per-n 15 --subsample 3 --frame-stack 4
-python -m training.bc_pretrain --demos training/demos_v3.npz \
-    --out training/runs/bc_v3 --epochs 60 --net-arch 512,512
+    --out training/demos.npz --seeds-per-n 15 --subsample 3 --frame-stack 4
+python -m training.bc_pretrain --demos training/demos.npz \
+    --out training/runs/bc --epochs 60 --net-arch 512,512

 # 4. Optional: DAgger from inside Webots if sim-trained doesn't transfer
 tools/auto_dagger.sh 3 60
 python -m tools.dagger_merge_train --out training/runs/bc_dagger

 # 5. Evaluate (env)
-python -m training.eval --policy training/runs/bc_v3 \
+python -m training.eval --policy training/runs/bc \
    --max-flock 10 --max-steps 8000 --n-seeds 5

 # 6. Optional RL fine-tune of the BC policy (~40 min on CPU, 1 M steps)
 python -m training.train_ppo \
-    --bc training/runs/bc_v3 \
-    --out training/runs/rl_v1 \
+    --bc training/runs/bc \
+    --out training/runs/rl \
    --total-timesteps 1000000

 # 7. Run in Webots
@@ -127,23 +136,48 @@ scattering the flock. Direction (intent) is preserved.
 All modes also share the same EMA action smoother in
 `controllers/shepherd_dog/shepherd_dog.py:ACTION_SMOOTH = 0.55`.

-## Webots results (steps to all-penned, fast mode)
+## Results — env eval, 10 seeds × n=1..10

-Single seed per cell using `worlds/field.wbt` defaults. All modes hit
-100 % pen rate; numbers shown are time-to-all-penned in simulation
-steps (16 ms each).
+`max_steps=15000`, full-field spawn distribution. Success rate per
+flock size, then mean steps over successful seeds.

-| n  | Strömbom | `bc` | `rl` (KL-PPO of `bc`) |
+### Success rate (%)
+
+| n  | Strömbom | `bc` | `rl` |
 |---:|---:|---:|---:|
-|  3 | 5 800 | 9 800 | **4 800** |
-|  5 | 10 200 | 9 200 | 9 800 |
-|  8 | 14 000 | 17 600 | **15 400** |
-| 10 | 18 600 | 19 600 | **12 000** |
+|  1 |  30 |  80 | **90** |
+|  2 |  90 |  50 | **90** |
+|  3 |  60 |  90 | **90** |
+|  4 |  40 |  80 | **90** |
+|  5 |  60 |  70 | **100** |
+|  6 |  30 |  80 | 80 |
+|  7 |  70 |  80 | **100** |
+|  8 |  30 | 100 | **100** |
+|  9 |  40 |  90 | **100** |
+| 10 |  50 | 100 | **100** |

-The RL fine-tune is **39 % faster than `bc` on n=10** and **51 % faster
-on n=3**, confirming the KL-anchored PPO actually finds reward-driven
-improvements over the BC imitation baseline rather than just collapsing
-back to it.
+### Mean penned per episode (out of n)
+
+| n  | Strömbom | `bc` | `rl` |
+|---:|---:|---:|---:|
+|  1 | 0.30 | 0.80 | **0.90** |
+|  5 | 3.90 | 4.10 | **5.00** |
+|  8 | 4.20 | 8.00 | **8.00** |
+| 10 | 7.40 | 10.00 | **10.00** |
+
+### Takeaways
+
+- **BC clearly beats Strömbom** under realistic LiDAR conditions (full
+  field, partial observability). Strömbom struggles on small flocks
+  where a single sheep can spawn beyond the LiDAR's 12 m range; BC
+  learned active perception from the demos.
+- **RL refines BC** without regressing on any cell. Ties or beats BC
+  at every flock size; biggest gains at n=1 and n=4 where BC's
+  imitation of Strömbom's drive heuristic was sub-optimal.
+- **Aggressive reward shaping doesn't help** — a more aggressive
+  variant (β=0.02, W_TIME=-0.1, W_IMITATE=0, 3 M steps) trained as
+  an ablation was strictly worse than the conservative tune shipped
+  here (β=0.05, W_IMITATE=0.5, 1 M steps).

 ## License