185 lines
7.0 KiB
Markdown
185 lines
7.0 KiB
Markdown
# Autonomous Shepherd-Dog Herding (Webots + RL)
|
||
|
||
Group G25 — *Diogo Costa, Johnny Fernandes, Nelson Neto*
|
||
|
||
A differential-drive shepherd dog that herds 1–10 sheep through a 3 m
|
||
gate into an external pen. The dog has three deployable modes:
|
||
|
||
| Mode | Source | Role |
|
||
|---|---|---|
|
||
| `strombom` | Strömbom et al. (2014) collect/drive heuristic | Analytic baseline |
|
||
| `bc` | Behaviour cloning of the Strömbom teacher | Imitation learning result |
|
||
| `rl` | KL-regularised PPO fine-tune of `bc` | Reward-driven refinement |
|
||
|
||
`sequential` (single-target pin-and-push) is kept as an alternative
|
||
analytic baseline.
|
||
|
||
## Perception
|
||
|
||
The dog perceives sheep **only through its front-mounted 140° LiDAR**
|
||
(180 rays, 12 m max range — see `protos/ShepherdDog.proto`). Each
|
||
control step:
|
||
|
||
1. Read `lidar.getRangeImage()`,
|
||
2. Cluster returns into world-frame `(x, y)` estimates
|
||
(`herding/lidar_perception.py`),
|
||
3. Fold them into a multi-target tracker that maintains last-seen
|
||
positions for sheep currently outside the FOV
|
||
(`herding/sheep_tracker.py`).
|
||
|
||
**LiDAR validation** (intermediate-goal item v from `docs/project.md`):
|
||
during development a diagnostic-dump controller captured 80 real
|
||
Webots scans plus the ground-truth sheep positions. Comparing
|
||
detections against GT showed clustered centroids match GT positions
|
||
within 0.15 m after the +SHEEP_RADIUS surface-to-centre correction —
|
||
i.e. the LiDAR pipeline produces correct sheep-position estimates
|
||
from the real Webots scan, validating the sensor for the herding
|
||
task.
|
||
|
||
The tracker outputs a `{name: (x, y)}` dict shaped exactly like the
|
||
prior receiver-based one, so Strömbom, Sequential, and the BC obs
|
||
builder all run unchanged on top of it. The 2D Gymnasium env
|
||
(`herding/lidar_sim.py`) raycasts sheep discs at training time, so
|
||
demos collected in the env match the perception the deployed
|
||
controller sees in Webots.
|
||
|
||
Privileged ground-truth perception is available for ablation —
|
||
`HerdingEnv(use_lidar=False)`.
|
||
|
||
## Quick start
|
||
|
||
```bash
|
||
# 1. Set up the Python env (any venv with PyTorch + SB3)
|
||
pip install -r training/requirements.txt
|
||
|
||
# 2. Smoke test
|
||
python -m tests.parity_test
|
||
|
||
# 3. Reproduce the BC policy (~10 min on CPU: ~5 min demos + ~3 min BC)
|
||
python -m tools.collect_demos --teacher strombom \
|
||
--out training/demos.npz --seeds-per-n 15 --subsample 3 --frame-stack 4
|
||
python -m training.bc_pretrain --demos training/demos.npz \
|
||
--out training/runs/bc --epochs 60 --net-arch 512,512
|
||
|
||
# 4. KL-PPO fine-tune of the BC policy (~30 min on CPU, 1 M steps)
|
||
python -m training.train_ppo \
|
||
--bc training/runs/bc \
|
||
--out training/runs/rl \
|
||
--total-timesteps 1000000
|
||
|
||
# 5. Evaluate (env)
|
||
python -m training.eval --policy training/runs/rl \
|
||
--max-flock 10 --max-steps 15000 --n-seeds 10
|
||
|
||
# 6. Run in Webots
|
||
tools/run_webots.sh 10 bc # behaviour-cloned MLP
|
||
tools/run_webots.sh 10 rl # KL-PPO fine-tune
|
||
tools/run_webots.sh 10 strombom # analytic baseline
|
||
```
|
||
|
||
## Layout
|
||
|
||
```
|
||
herding/ — perception / control / world primitives
|
||
obs.py — 32-D order-invariant observation builder
|
||
world/ — environment-side physics & geometry
|
||
geometry.py field/pen constants, robot specs
|
||
diffdrive.py differential-drive kinematics
|
||
flocking_sim.py Reynolds + Strömbom 2014 sheep dynamics
|
||
perception/ — LiDAR → tracked-sheep pipeline
|
||
lidar_sim.py fast 2D raycast for the env
|
||
lidar_perception.py scan → world-frame cluster centroids + filters
|
||
sheep_tracker.py multi-target NN tracker with FOV memory
|
||
control/ — every dog mode's action source
|
||
strombom.py canonical CoM collect/drive heuristic
|
||
sequential.py single-target "pin-and-push" alternative
|
||
active_scan.py wraps a base teacher with opening rotation +
|
||
walk-to-centre fallback
|
||
modulation.py shared near-sheep speed-modulation helper
|
||
|
||
controllers/
|
||
sheep/sheep.py — Webots sheep controller (uses herding.world.flocking_sim)
|
||
shepherd_dog/
|
||
shepherd_dog.py — Webots dog controller, mode-switched
|
||
policy_loader.py — lazy SB3 policy loader (auto-detects frame stack)
|
||
|
||
training/
|
||
herding_env.py — Gymnasium env (LiDAR + tracker by default)
|
||
bc_pretrain.py — supervised BC of (obs, action) demos into MLP
|
||
train_ppo.py — KL-regularised PPO fine-tune of BC
|
||
eval.py — analytic + learned policy comparison harness
|
||
runs/ — checkpoints (whitelisted in .gitignore)
|
||
requirements.txt
|
||
|
||
tests/
|
||
parity_test.py — shape / determinism / baseline smoke test
|
||
|
||
tools/
|
||
collect_demos.py — sim demos via the active-scan teacher
|
||
run_webots.sh — launch Webots with N sheep + chosen mode
|
||
|
||
worlds/
|
||
field.wbt — main world (3 m gate, external pen)
|
||
|
||
protos/ — Sheep / ShepherdDog robot definitions
|
||
docs/project.md — original project goals
|
||
```
|
||
|
||
## Shared low-level control
|
||
|
||
Every dog mode (Strömbom, Sequential, BC, RL) routes its action
|
||
through `herding/control/modulation.py:modulate_speed_near_sheep`,
|
||
which scales action magnitude down when within ~2.5 m of the nearest
|
||
tracked sheep. This stops the dog from charging in at full speed and
|
||
scattering the flock. Direction (intent) is preserved.
|
||
|
||
All modes also share the same EMA action smoother in
|
||
`controllers/shepherd_dog/shepherd_dog.py:ACTION_SMOOTH = 0.55`.
|
||
|
||
## Results — env eval, 10 seeds × n=1..10
|
||
|
||
`max_steps=15000`, full-field spawn distribution. Success rate per
|
||
flock size, then mean steps over successful seeds.
|
||
|
||
### Success rate (%)
|
||
|
||
| n | Strömbom | `bc` | `rl` |
|
||
|---:|---:|---:|---:|
|
||
| 1 | 30 | 80 | **90** |
|
||
| 2 | 90 | 50 | **90** |
|
||
| 3 | 60 | 90 | **90** |
|
||
| 4 | 40 | 80 | **90** |
|
||
| 5 | 60 | 70 | **100** |
|
||
| 6 | 30 | 80 | 80 |
|
||
| 7 | 70 | 80 | **100** |
|
||
| 8 | 30 | 100 | **100** |
|
||
| 9 | 40 | 90 | **100** |
|
||
| 10 | 50 | 100 | **100** |
|
||
|
||
### Mean penned per episode (out of n)
|
||
|
||
| n | Strömbom | `bc` | `rl` |
|
||
|---:|---:|---:|---:|
|
||
| 1 | 0.30 | 0.80 | **0.90** |
|
||
| 5 | 3.90 | 4.10 | **5.00** |
|
||
| 8 | 4.20 | 8.00 | **8.00** |
|
||
| 10 | 7.40 | 10.00 | **10.00** |
|
||
|
||
### Takeaways
|
||
|
||
- **BC clearly beats Strömbom** under realistic LiDAR conditions (full
|
||
field, partial observability). Strömbom struggles on small flocks
|
||
where a single sheep can spawn beyond the LiDAR's 12 m range; BC
|
||
learned active perception from the demos.
|
||
- **RL refines BC** without regressing on any cell. Ties or beats BC
|
||
at every flock size; biggest gains at n=1 and n=4 where BC's
|
||
imitation of Strömbom's drive heuristic was sub-optimal.
|
||
- **Aggressive reward shaping doesn't help** — a more aggressive
|
||
variant (β=0.02, W_TIME=-0.1, W_IMITATE=0, 3 M steps) trained as
|
||
an ablation was strictly worse than the conservative tune shipped
|
||
here (β=0.05, W_IMITATE=0.5, 1 M steps).
|
||
|
||
## License
|
||
|
||
Educational project for the *Topics in Intelligent Robotics* course.
|