185 lines
7.1 KiB
Markdown
185 lines
7.1 KiB
Markdown
# Autonomous Shepherd-Dog Herding (Webots + RL)
|
||
|
||
Group G25 — *Diogo Costa, Johnny Fernandes, Nelson Neto*
|
||
|
||
A differential-drive shepherd dog that herds 1–10 sheep through a 3 m
|
||
gate into an external pen. The dog has three deployable modes:
|
||
|
||
| Mode | Source | Role |
|
||
|---|---|---|
|
||
| `strombom` | Strömbom et al. (2014) collect/drive heuristic | Analytic baseline |
|
||
| `bc` | Behaviour cloning of the Strömbom teacher | Imitation learning result |
|
||
| `rl` | KL-regularised PPO fine-tune of `bc` | Reward-driven refinement |
|
||
|
||
`sequential` (single-target pin-and-push) is kept as an alternative
|
||
analytic baseline. `dagger` is a data-collection mode, not deployment.
|
||
|
||
## Perception
|
||
|
||
The dog perceives sheep **only through its front-mounted 140° LiDAR**
|
||
(180 rays, 12 m max range — see `protos/ShepherdDog.proto`). Each
|
||
control step:
|
||
|
||
1. Read `lidar.getRangeImage()`,
|
||
2. Cluster returns into world-frame `(x, y)` estimates
|
||
(`herding/lidar_perception.py`),
|
||
3. Fold them into a multi-target tracker that maintains last-seen
|
||
positions for sheep currently outside the FOV
|
||
(`herding/sheep_tracker.py`).
|
||
|
||
**LiDAR validation** (intermediate-goal item v from `docs/project.md`):
|
||
run the dog controller in `HERDING_MODE=diag` mode to capture 80
|
||
real Webots scans plus the ground-truth sheep positions in
|
||
`training/dagger/diag_<ts>.npz`. Comparing detections against GT in
|
||
that file showed clustered centroids match GT positions within 0.15 m
|
||
after the +SHEEP_RADIUS surface-to-centre correction — i.e. the
|
||
LiDAR pipeline produces correct sheep-position estimates from the
|
||
real Webots scan, validating the sensor for the herding task.
|
||
|
||
The tracker outputs a `{name: (x, y)}` dict shaped exactly like the
|
||
prior receiver-based one, so Strömbom, Sequential, and the BC obs
|
||
builder all run unchanged on top of it. The 2D Gymnasium env
|
||
(`herding/lidar_sim.py`) raycasts sheep discs at training time, so
|
||
demos collected in the env match the perception the deployed
|
||
controller sees in Webots.
|
||
|
||
Privileged ground-truth perception is available for ablation —
|
||
`HerdingEnv(use_lidar=False)`.
|
||
|
||
## Quick start
|
||
|
||
```bash
|
||
# 1. Set up the Python env (any venv with PyTorch + SB3)
|
||
pip install -r training/requirements.txt
|
||
|
||
# 2. Smoke test
|
||
python -m training.parity_test
|
||
|
||
# 3. Reproduce the BC policy (~10 min on CPU: ~5 min demos + ~3 min BC)
|
||
python -m tools.collect_demos --teacher strombom \
|
||
--out training/demos.npz --seeds-per-n 15 --subsample 3 --frame-stack 4
|
||
python -m training.bc_pretrain --demos training/demos.npz \
|
||
--out training/runs/bc --epochs 60 --net-arch 512,512
|
||
|
||
# 4. Optional: DAgger from inside Webots if sim-trained doesn't transfer
|
||
tools/auto_dagger.sh 3 60
|
||
python -m tools.dagger_merge_train --out training/runs/bc_dagger
|
||
|
||
# 5. Evaluate (env)
|
||
python -m training.eval --policy training/runs/bc \
|
||
--max-flock 10 --max-steps 8000 --n-seeds 5
|
||
|
||
# 6. Optional RL fine-tune of the BC policy (~40 min on CPU, 1 M steps)
|
||
python -m training.train_ppo \
|
||
--bc training/runs/bc \
|
||
--out training/runs/rl \
|
||
--total-timesteps 1000000
|
||
|
||
# 7. Run in Webots
|
||
tools/run_webots.sh 10 bc # behaviour-cloned MLP
|
||
tools/run_webots.sh 10 rl # KL-PPO fine-tune
|
||
tools/run_webots.sh 10 strombom # analytic baseline
|
||
```
|
||
|
||
## Layout
|
||
|
||
```
|
||
herding/ — single source of truth (env + Webots both import)
|
||
geometry.py — field/pen constants, robot specs
|
||
flocking_sim.py — Reynolds-style sheep dynamics
|
||
diffdrive.py — differential-drive kinematics
|
||
control.py — shared near-sheep speed-modulation helper
|
||
obs.py — 32-D order-invariant observation builder
|
||
strombom.py — canonical CoM-drive teacher
|
||
sequential.py — single-target "pin-and-push" teacher
|
||
active_scan.py — wraps a base teacher with opening rotation +
|
||
walk-to-centre + speed modulation
|
||
lidar_sim.py — fast 2D raycast for the env (sheep + walls + posts)
|
||
lidar_perception.py — scan → world-frame cluster centroids + filters
|
||
sheep_tracker.py — multi-target NN tracker with FOV memory
|
||
|
||
controllers/
|
||
sheep/sheep.py — Webots sheep controller (uses herding.flocking_sim)
|
||
shepherd_dog/
|
||
shepherd_dog.py — Webots dog controller, mode-switched
|
||
policy_loader.py — lazy SB3 policy loader (auto-detects frame stack)
|
||
|
||
training/
|
||
herding_env.py — Gymnasium env (LiDAR + tracker by default)
|
||
bc_pretrain.py — supervised BC of (obs, action) demos into MLP
|
||
eval.py — analytic + BC policy comparison harness
|
||
parity_test.py — shape / determinism smoke test
|
||
runs/ — checkpoints (whitelisted in .gitignore)
|
||
requirements.txt
|
||
|
||
tools/
|
||
collect_demos.py — sim demos via the active-scan teacher
|
||
dagger_merge_train.py — merge Webots-collected DAgger demos and retrain
|
||
run_webots.sh — launch Webots with N sheep + chosen mode
|
||
auto_dagger.sh — headless DAgger collection across many runs
|
||
|
||
worlds/
|
||
field.wbt — main world (3 m gate, external pen)
|
||
|
||
protos/ — Sheep / ShepherdDog robot definitions
|
||
docs/project.md — original project goals
|
||
```
|
||
|
||
## Shared low-level control
|
||
|
||
Every dog mode (RL, Strömbom, Sequential, the DAgger teacher) routes
|
||
its action through `herding/control.py:modulate_speed_near_sheep`,
|
||
which scales action magnitude down when within ~2.5 m of the nearest
|
||
tracked sheep. This stops the dog from charging in at full speed and
|
||
scattering the flock. Direction (intent) is preserved.
|
||
|
||
All modes also share the same EMA action smoother in
|
||
`controllers/shepherd_dog/shepherd_dog.py:ACTION_SMOOTH = 0.55`.
|
||
|
||
## Results — env eval, 10 seeds × n=1..10
|
||
|
||
`max_steps=15000`, full-field spawn distribution. Success rate per
|
||
flock size, then mean steps over successful seeds.
|
||
|
||
### Success rate (%)
|
||
|
||
| n | Strömbom | `bc` | `rl` |
|
||
|---:|---:|---:|---:|
|
||
| 1 | 30 | 80 | **90** |
|
||
| 2 | 90 | 50 | **90** |
|
||
| 3 | 60 | 90 | **90** |
|
||
| 4 | 40 | 80 | **90** |
|
||
| 5 | 60 | 70 | **100** |
|
||
| 6 | 30 | 80 | 80 |
|
||
| 7 | 70 | 80 | **100** |
|
||
| 8 | 30 | 100 | **100** |
|
||
| 9 | 40 | 90 | **100** |
|
||
| 10 | 50 | 100 | **100** |
|
||
|
||
### Mean penned per episode (out of n)
|
||
|
||
| n | Strömbom | `bc` | `rl` |
|
||
|---:|---:|---:|---:|
|
||
| 1 | 0.30 | 0.80 | **0.90** |
|
||
| 5 | 3.90 | 4.10 | **5.00** |
|
||
| 8 | 4.20 | 8.00 | **8.00** |
|
||
| 10 | 7.40 | 10.00 | **10.00** |
|
||
|
||
### Takeaways
|
||
|
||
- **BC clearly beats Strömbom** under realistic LiDAR conditions (full
|
||
field, partial observability). Strömbom struggles on small flocks
|
||
where a single sheep can spawn beyond the LiDAR's 12 m range; BC
|
||
learned active perception from the demos.
|
||
- **RL refines BC** without regressing on any cell. Ties or beats BC
|
||
at every flock size; biggest gains at n=1 and n=4 where BC's
|
||
imitation of Strömbom's drive heuristic was sub-optimal.
|
||
- **Aggressive reward shaping doesn't help** — a more aggressive
|
||
variant (β=0.02, W_TIME=-0.1, W_IMITATE=0, 3 M steps) trained as
|
||
an ablation was strictly worse than the conservative tune shipped
|
||
here (β=0.05, W_IMITATE=0.5, 1 M steps).
|
||
|
||
## License
|
||
|
||
Educational project for the *Topics in Intelligent Robotics* course.
|