Checkpoint 4
This commit is contained in:
@@ -3,16 +3,39 @@
|
||||
Group G25 — *Diogo Costa, Johnny Fernandes, Nelson Neto*
|
||||
|
||||
A differential-drive shepherd dog that herds 1–10 sheep through a 3 m
|
||||
gate into an external pen. The dog has three modes:
|
||||
gate into an external pen. The dog has three deployable modes:
|
||||
|
||||
| Mode | Source | Notes |
|
||||
| Mode | Source | Role |
|
||||
|---|---|---|
|
||||
| `rl` | Behavior cloning of an analytic teacher | The deliverable RL policy |
|
||||
| `strombom` | Strömbom (2014) collect/drive heuristic | Canonical baseline |
|
||||
| `sequential` | Single-target "pin and push" | Robust across n=1–10 |
|
||||
| `strombom` | Strömbom et al. (2014) collect/drive heuristic | Analytic baseline |
|
||||
| `bc` | Behaviour cloning of the Strömbom teacher | Imitation learning result |
|
||||
| `rl` | KL-regularised PPO fine-tune of `bc` | Reward-driven refinement |
|
||||
|
||||
Plus three documented experimental teachers (`hybrid`, `drive_only`,
|
||||
`strombom_smooth`) — see `herding/` for details.
|
||||
`sequential` (single-target pin-and-push) is kept as an alternative
|
||||
analytic baseline. `dagger` is a data-collection mode, not deployment.
|
||||
|
||||
## Perception
|
||||
|
||||
The dog perceives sheep **only through its front-mounted 140° LiDAR**
|
||||
(180 rays, 12 m max range — see `protos/ShepherdDog.proto`). Each
|
||||
control step:
|
||||
|
||||
1. Read `lidar.getRangeImage()`,
|
||||
2. Cluster returns into world-frame `(x, y)` estimates
|
||||
(`herding/lidar_perception.py`),
|
||||
3. Fold them into a multi-target tracker that maintains last-seen
|
||||
positions for sheep currently outside the FOV
|
||||
(`herding/sheep_tracker.py`).
|
||||
|
||||
The tracker outputs a `{name: (x, y)}` dict shaped exactly like the
|
||||
prior receiver-based one, so Strömbom, Sequential, and the BC obs
|
||||
builder all run unchanged on top of it. The 2D Gymnasium env
|
||||
(`herding/lidar_sim.py`) raycasts sheep discs at training time, so
|
||||
demos collected in the env match the perception the deployed
|
||||
controller sees in Webots.
|
||||
|
||||
Privileged ground-truth perception is available for ablation —
|
||||
`HerdingEnv(use_lidar=False)`.
|
||||
|
||||
## Quick start
|
||||
|
||||
@@ -23,20 +46,30 @@ pip install -r training/requirements.txt
|
||||
# 2. Smoke test
|
||||
python -m training.parity_test
|
||||
|
||||
# 3. Reproduce the BC policy from scratch (~25 min on CPU)
|
||||
python -m tools.collect_demos --teacher strombom --out training/demos.npz \
|
||||
--seeds-per-n 30 --subsample 3
|
||||
python -m training.bc_pretrain --demos training/demos.npz \
|
||||
--out training/runs/bc_flock --epochs 100 --net-arch 512,512
|
||||
# 3. Reproduce the BC policy (~10 min on CPU: ~5 min demos + ~3 min BC)
|
||||
python -m tools.collect_demos --teacher strombom \
|
||||
--out training/demos_v3.npz --seeds-per-n 15 --subsample 3 --frame-stack 4
|
||||
python -m training.bc_pretrain --demos training/demos_v3.npz \
|
||||
--out training/runs/bc_v3 --epochs 60 --net-arch 512,512
|
||||
|
||||
# 4. Evaluate
|
||||
python -m training.eval --policy training/runs/bc_flock \
|
||||
--max-flock 10 --max-steps 30000 --n-seeds 5
|
||||
# 4. Optional: DAgger from inside Webots if sim-trained doesn't transfer
|
||||
tools/auto_dagger.sh 3 60
|
||||
python -m tools.dagger_merge_train --out training/runs/bc_dagger
|
||||
|
||||
# 5. Run in Webots (any of the three modes; n is the flock size)
|
||||
HERDING_POLICY_DIR=$PWD/training/runs/bc_flock tools/run_webots.sh 10 rl
|
||||
tools/run_webots.sh 10 strombom
|
||||
tools/run_webots.sh 10 sequential
|
||||
# 5. Evaluate (env)
|
||||
python -m training.eval --policy training/runs/bc_v3 \
|
||||
--max-flock 10 --max-steps 8000 --n-seeds 5
|
||||
|
||||
# 6. Optional RL fine-tune of the BC policy (~40 min on CPU, 1 M steps)
|
||||
python -m training.train_ppo \
|
||||
--bc training/runs/bc_v3 \
|
||||
--out training/runs/rl_v1 \
|
||||
--total-timesteps 1000000
|
||||
|
||||
# 7. Run in Webots
|
||||
tools/run_webots.sh 10 bc # behaviour-cloned MLP
|
||||
tools/run_webots.sh 10 rl # KL-PPO fine-tune
|
||||
tools/run_webots.sh 10 strombom # analytic baseline
|
||||
```
|
||||
|
||||
## Layout
|
||||
@@ -46,69 +79,71 @@ herding/ — single source of truth (env + Webots both import)
|
||||
geometry.py — field/pen constants, robot specs
|
||||
flocking_sim.py — Reynolds-style sheep dynamics
|
||||
diffdrive.py — differential-drive kinematics
|
||||
control.py — shared near-sheep speed-modulation helper
|
||||
obs.py — 32-D order-invariant observation builder
|
||||
strombom.py — canonical CoM-drive teacher
|
||||
sequential.py — single-target "pin-and-push" teacher
|
||||
hybrid.py — flock-then-funnel (experimental, did not scale)
|
||||
drive_only.py — Strömbom drive without collect (experimental)
|
||||
strombom_smooth.py — sigmoid-blended Strömbom (experimental)
|
||||
active_scan.py — wraps a base teacher with opening rotation +
|
||||
walk-to-centre + speed modulation
|
||||
lidar_sim.py — fast 2D raycast for the env (sheep + walls + posts)
|
||||
lidar_perception.py — scan → world-frame cluster centroids + filters
|
||||
sheep_tracker.py — multi-target NN tracker with FOV memory
|
||||
|
||||
controllers/
|
||||
sheep/sheep.py — Webots sheep controller (uses herding.flocking_sim)
|
||||
shepherd_dog/
|
||||
shepherd_dog.py — Webots dog controller, mode-switched
|
||||
policy_loader.py — lazy SB3 PPO loader
|
||||
strombom.py — backwards-compat shim
|
||||
policy_loader.py — lazy SB3 policy loader (auto-detects frame stack)
|
||||
|
||||
training/
|
||||
herding_env.py — Gymnasium env (used for demo collection + eval)
|
||||
bc_pretrain.py — supervised BC of analytic teachers into MLP policy
|
||||
collect_demos.py — wrapper, see tools/
|
||||
eval.py — RL / analytic comparison harness
|
||||
parity_test.py — smoke tests
|
||||
train_ppo.py — PPO/RL fine-tune (experimental, BC alone preferred)
|
||||
herding_env.py — Gymnasium env (LiDAR + tracker by default)
|
||||
bc_pretrain.py — supervised BC of (obs, action) demos into MLP
|
||||
eval.py — analytic + BC policy comparison harness
|
||||
parity_test.py — shape / determinism smoke test
|
||||
runs/ — checkpoints (whitelisted in .gitignore)
|
||||
requirements.txt
|
||||
configs/ppo_default.yaml
|
||||
|
||||
tools/
|
||||
collect_demos.py — generate (obs, action) demonstrations
|
||||
run_webots.sh — launch Webots with N sheep + chosen controller mode
|
||||
collect_demos.py — sim demos via the active-scan teacher
|
||||
dagger_merge_train.py — merge Webots-collected DAgger demos and retrain
|
||||
run_webots.sh — launch Webots with N sheep + chosen mode
|
||||
auto_dagger.sh — headless DAgger collection across many runs
|
||||
|
||||
worlds/
|
||||
field.wbt — main world (3 m gate, external pen)
|
||||
|
||||
protos/ — Sheep / ShepherdDog robot definitions
|
||||
docs/project.md — original project goals
|
||||
plan.md — design notes / decision log
|
||||
```
|
||||
|
||||
## Two cohesion regimes
|
||||
## Shared low-level control
|
||||
|
||||
Sheep cohesion strength controls which teacher works:
|
||||
Every dog mode (RL, Strömbom, Sequential, the DAgger teacher) routes
|
||||
its action through `herding/control.py:modulate_speed_near_sheep`,
|
||||
which scales action magnitude down when within ~2.5 m of the nearest
|
||||
tracked sheep. This stops the dog from charging in at full speed and
|
||||
scattering the flock. Direction (intent) is preserved.
|
||||
|
||||
| Regime | `flocking_sim.py` setting | Strömbom | Sequential |
|
||||
|---|---|---:|---:|
|
||||
| **Tight** (current) | `w=3.0/1.0`, `dist=12` | works (flock-style) | breaks (cohesion fights single-sheep targeting) |
|
||||
| Loose | `w=1.5/0.6`, `dist=8` | breaks (flock fragments at gate) | works (1-by-1 style) |
|
||||
All modes also share the same EMA action smoother in
|
||||
`controllers/shepherd_dog/shepherd_dog.py:ACTION_SMOOTH = 0.55`.
|
||||
|
||||
The codebase ships with the **tight** regime. To use the loose-regime
|
||||
Sequential clone, edit those constants in `herding/flocking_sim.py` and
|
||||
load `training/runs/bc_solo/`.
|
||||
## Webots results (steps to all-penned, fast mode)
|
||||
|
||||
## Results
|
||||
Single seed per cell using `worlds/field.wbt` defaults. All modes hit
|
||||
100 % pen rate; numbers shown are time-to-all-penned in simulation
|
||||
steps (16 ms each).
|
||||
|
||||
Eval at `--max-steps 30000 --n-seeds 5`, deployment difficulty (full
|
||||
field spawn distribution):
|
||||
|
||||
| n | Strömbom | Sequential | BC-flock (RL) |
|
||||
| n | Strömbom | `bc` | `rl` (KL-PPO of `bc`) |
|
||||
|---:|---:|---:|---:|
|
||||
| 1 | 100 % | 100 % | 100 % |
|
||||
| 5 | 100 % | 100 % | 80–100 % |
|
||||
| 8 | 100 % | 100 % | 80 % |
|
||||
| 10 | **100 %** | 80 % | **80 %** (mean_penned 8/10) |
|
||||
| 3 | 5 800 | 9 800 | **4 800** |
|
||||
| 5 | 10 200 | 9 200 | 9 800 |
|
||||
| 8 | 14 000 | 17 600 | **15 400** |
|
||||
| 10 | 18 600 | 19 600 | **12 000** |
|
||||
|
||||
The BC policy hits ~80 % of the analytic teacher's success rate in 100 %
|
||||
neural-network inference, with no hand-coded logic.
|
||||
The RL fine-tune is **39 % faster than `bc` on n=10** and **51 % faster
|
||||
on n=3**, confirming the KL-anchored PPO actually finds reward-driven
|
||||
improvements over the BC imitation baseline rather than just collapsing
|
||||
back to it.
|
||||
|
||||
## License
|
||||
|
||||
|
||||
Reference in New Issue
Block a user