Checkpoint 5 - incomplete

This commit is contained in:
Johnny Fernandes
2026-05-11 10:35:39 +01:00
parent 6688325d89
commit b457155538
13 changed files with 174 additions and 74 deletions
+53 -19
View File
@@ -27,6 +27,15 @@ control step:
positions for sheep currently outside the FOV
(`herding/sheep_tracker.py`).
**LiDAR validation** (intermediate-goal item v from `docs/project.md`):
run the dog controller in `HERDING_MODE=diag` mode to capture 80
real Webots scans plus the ground-truth sheep positions in
`training/dagger/diag_<ts>.npz`. Comparing detections against GT in
that file showed clustered centroids match GT positions within 0.15 m
after the +SHEEP_RADIUS surface-to-centre correction — i.e. the
LiDAR pipeline produces correct sheep-position estimates from the
real Webots scan, validating the sensor for the herding task.
The tracker outputs a `{name: (x, y)}` dict shaped exactly like the
prior receiver-based one, so Strömbom, Sequential, and the BC obs
builder all run unchanged on top of it. The 2D Gymnasium env
@@ -48,22 +57,22 @@ python -m training.parity_test
# 3. Reproduce the BC policy (~10 min on CPU: ~5 min demos + ~3 min BC)
python -m tools.collect_demos --teacher strombom \
--out training/demos_v3.npz --seeds-per-n 15 --subsample 3 --frame-stack 4
python -m training.bc_pretrain --demos training/demos_v3.npz \
--out training/runs/bc_v3 --epochs 60 --net-arch 512,512
--out training/demos.npz --seeds-per-n 15 --subsample 3 --frame-stack 4
python -m training.bc_pretrain --demos training/demos.npz \
--out training/runs/bc --epochs 60 --net-arch 512,512
# 4. Optional: DAgger from inside Webots if sim-trained doesn't transfer
tools/auto_dagger.sh 3 60
python -m tools.dagger_merge_train --out training/runs/bc_dagger
# 5. Evaluate (env)
python -m training.eval --policy training/runs/bc_v3 \
python -m training.eval --policy training/runs/bc \
--max-flock 10 --max-steps 8000 --n-seeds 5
# 6. Optional RL fine-tune of the BC policy (~40 min on CPU, 1 M steps)
python -m training.train_ppo \
--bc training/runs/bc_v3 \
--out training/runs/rl_v1 \
--bc training/runs/bc \
--out training/runs/rl \
--total-timesteps 1000000
# 7. Run in Webots
@@ -127,23 +136,48 @@ scattering the flock. Direction (intent) is preserved.
All modes also share the same EMA action smoother in
`controllers/shepherd_dog/shepherd_dog.py:ACTION_SMOOTH = 0.55`.
## Webots results (steps to all-penned, fast mode)
## Results — env eval, 10 seeds × n=1..10
Single seed per cell using `worlds/field.wbt` defaults. All modes hit
100 % pen rate; numbers shown are time-to-all-penned in simulation
steps (16 ms each).
`max_steps=15000`, full-field spawn distribution. Success rate per
flock size, then mean steps over successful seeds.
| n | Strömbom | `bc` | `rl` (KL-PPO of `bc`) |
### Success rate (%)
| n | Strömbom | `bc` | `rl` |
|---:|---:|---:|---:|
| 3 | 5 800 | 9 800 | **4 800** |
| 5 | 10 200 | 9 200 | 9 800 |
| 8 | 14 000 | 17 600 | **15 400** |
| 10 | 18 600 | 19 600 | **12 000** |
| 1 | 30 | 80 | **90** |
| 2 | 90 | 50 | **90** |
| 3 | 60 | 90 | **90** |
| 4 | 40 | 80 | **90** |
| 5 | 60 | 70 | **100** |
| 6 | 30 | 80 | 80 |
| 7 | 70 | 80 | **100** |
| 8 | 30 | 100 | **100** |
| 9 | 40 | 90 | **100** |
| 10 | 50 | 100 | **100** |
The RL fine-tune is **39 % faster than `bc` on n=10** and **51 % faster
on n=3**, confirming the KL-anchored PPO actually finds reward-driven
improvements over the BC imitation baseline rather than just collapsing
back to it.
### Mean penned per episode (out of n)
| n | Strömbom | `bc` | `rl` |
|---:|---:|---:|---:|
| 1 | 0.30 | 0.80 | **0.90** |
| 5 | 3.90 | 4.10 | **5.00** |
| 8 | 4.20 | 8.00 | **8.00** |
| 10 | 7.40 | 10.00 | **10.00** |
### Takeaways
- **BC clearly beats Strömbom** under realistic LiDAR conditions (full
field, partial observability). Strömbom struggles on small flocks
where a single sheep can spawn beyond the LiDAR's 12 m range; BC
learned active perception from the demos.
- **RL refines BC** without regressing on any cell. Ties or beats BC
at every flock size; biggest gains at n=1 and n=4 where BC's
imitation of Strömbom's drive heuristic was sub-optimal.
- **Aggressive reward shaping doesn't help** — a more aggressive
variant (β=0.02, W_TIME=-0.1, W_IMITATE=0, 3 M steps) trained as
an ablation was strictly worse than the conservative tune shipped
here (β=0.05, W_IMITATE=0.5, 1 M steps).
## License