Checkpoint 5 - incomplete
This commit is contained in:
@@ -27,6 +27,15 @@ control step:
|
||||
positions for sheep currently outside the FOV
|
||||
(`herding/sheep_tracker.py`).
|
||||
|
||||
**LiDAR validation** (intermediate-goal item v from `docs/project.md`):
|
||||
run the dog controller in `HERDING_MODE=diag` mode to capture 80
|
||||
real Webots scans plus the ground-truth sheep positions in
|
||||
`training/dagger/diag_<ts>.npz`. Comparing detections against GT in
|
||||
that file showed clustered centroids match GT positions within 0.15 m
|
||||
after the +SHEEP_RADIUS surface-to-centre correction — i.e. the
|
||||
LiDAR pipeline produces correct sheep-position estimates from the
|
||||
real Webots scan, validating the sensor for the herding task.
|
||||
|
||||
The tracker outputs a `{name: (x, y)}` dict shaped exactly like the
|
||||
prior receiver-based one, so Strömbom, Sequential, and the BC obs
|
||||
builder all run unchanged on top of it. The 2D Gymnasium env
|
||||
@@ -48,22 +57,22 @@ python -m training.parity_test
|
||||
|
||||
# 3. Reproduce the BC policy (~10 min on CPU: ~5 min demos + ~3 min BC)
|
||||
python -m tools.collect_demos --teacher strombom \
|
||||
--out training/demos_v3.npz --seeds-per-n 15 --subsample 3 --frame-stack 4
|
||||
python -m training.bc_pretrain --demos training/demos_v3.npz \
|
||||
--out training/runs/bc_v3 --epochs 60 --net-arch 512,512
|
||||
--out training/demos.npz --seeds-per-n 15 --subsample 3 --frame-stack 4
|
||||
python -m training.bc_pretrain --demos training/demos.npz \
|
||||
--out training/runs/bc --epochs 60 --net-arch 512,512
|
||||
|
||||
# 4. Optional: DAgger from inside Webots if sim-trained doesn't transfer
|
||||
tools/auto_dagger.sh 3 60
|
||||
python -m tools.dagger_merge_train --out training/runs/bc_dagger
|
||||
|
||||
# 5. Evaluate (env)
|
||||
python -m training.eval --policy training/runs/bc_v3 \
|
||||
python -m training.eval --policy training/runs/bc \
|
||||
--max-flock 10 --max-steps 8000 --n-seeds 5
|
||||
|
||||
# 6. Optional RL fine-tune of the BC policy (~40 min on CPU, 1 M steps)
|
||||
python -m training.train_ppo \
|
||||
--bc training/runs/bc_v3 \
|
||||
--out training/runs/rl_v1 \
|
||||
--bc training/runs/bc \
|
||||
--out training/runs/rl \
|
||||
--total-timesteps 1000000
|
||||
|
||||
# 7. Run in Webots
|
||||
@@ -127,23 +136,48 @@ scattering the flock. Direction (intent) is preserved.
|
||||
All modes also share the same EMA action smoother in
|
||||
`controllers/shepherd_dog/shepherd_dog.py:ACTION_SMOOTH = 0.55`.
|
||||
|
||||
## Webots results (steps to all-penned, fast mode)
|
||||
## Results — env eval, 10 seeds × n=1..10
|
||||
|
||||
Single seed per cell using `worlds/field.wbt` defaults. All modes hit
|
||||
100 % pen rate; numbers shown are time-to-all-penned in simulation
|
||||
steps (16 ms each).
|
||||
`max_steps=15000`, full-field spawn distribution. Success rate per
|
||||
flock size, then mean steps over successful seeds.
|
||||
|
||||
| n | Strömbom | `bc` | `rl` (KL-PPO of `bc`) |
|
||||
### Success rate (%)
|
||||
|
||||
| n | Strömbom | `bc` | `rl` |
|
||||
|---:|---:|---:|---:|
|
||||
| 3 | 5 800 | 9 800 | **4 800** |
|
||||
| 5 | 10 200 | 9 200 | 9 800 |
|
||||
| 8 | 14 000 | 17 600 | **15 400** |
|
||||
| 10 | 18 600 | 19 600 | **12 000** |
|
||||
| 1 | 30 | 80 | **90** |
|
||||
| 2 | 90 | 50 | **90** |
|
||||
| 3 | 60 | 90 | **90** |
|
||||
| 4 | 40 | 80 | **90** |
|
||||
| 5 | 60 | 70 | **100** |
|
||||
| 6 | 30 | 80 | 80 |
|
||||
| 7 | 70 | 80 | **100** |
|
||||
| 8 | 30 | 100 | **100** |
|
||||
| 9 | 40 | 90 | **100** |
|
||||
| 10 | 50 | 100 | **100** |
|
||||
|
||||
The RL fine-tune is **39 % faster than `bc` on n=10** and **51 % faster
|
||||
on n=3**, confirming the KL-anchored PPO actually finds reward-driven
|
||||
improvements over the BC imitation baseline rather than just collapsing
|
||||
back to it.
|
||||
### Mean penned per episode (out of n)
|
||||
|
||||
| n | Strömbom | `bc` | `rl` |
|
||||
|---:|---:|---:|---:|
|
||||
| 1 | 0.30 | 0.80 | **0.90** |
|
||||
| 5 | 3.90 | 4.10 | **5.00** |
|
||||
| 8 | 4.20 | 8.00 | **8.00** |
|
||||
| 10 | 7.40 | 10.00 | **10.00** |
|
||||
|
||||
### Takeaways
|
||||
|
||||
- **BC clearly beats Strömbom** under realistic LiDAR conditions (full
|
||||
field, partial observability). Strömbom struggles on small flocks
|
||||
where a single sheep can spawn beyond the LiDAR's 12 m range; BC
|
||||
learned active perception from the demos.
|
||||
- **RL refines BC** without regressing on any cell. Ties or beats BC
|
||||
at every flock size; biggest gains at n=1 and n=4 where BC's
|
||||
imitation of Strömbom's drive heuristic was sub-optimal.
|
||||
- **Aggressive reward shaping doesn't help** — a more aggressive
|
||||
variant (β=0.02, W_TIME=-0.1, W_IMITATE=0, 3 M steps) trained as
|
||||
an ablation was strictly worse than the conservative tune shipped
|
||||
here (β=0.05, W_IMITATE=0.5, 1 M steps).
|
||||
|
||||
## License
|
||||
|
||||
|
||||
Reference in New Issue
Block a user