Files
TIR_PROJ/plan.md
T
Johnny Fernandes 1bb9415414 Checkpoint 2
2026-05-07 22:00:10 +01:00

459 lines
21 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# RL-Driven Shepherd Herding — Implementation Plan
This plan turns the existing Strömbom-only Webots project into a dual-mode
shepherd controller (RL primary, Strömbom fallback), with a fast Gymnasium
training environment that mirrors the Webots dynamics tightly enough for
sim-to-sim transfer. Stable-Baselines3 PPO is the learner.
---
## 1. Current state (audit)
### World geometry — `worlds/field.wbt`
- Field bounded by stone walls at **x,y ∈ [15, +15]**. Inside-usable area is
~[14.5, 14.5] (`X_MIN/MAX` in `flocking.py`).
- **Pen is *inside* the field**: x ∈ [10, 13], y ∈ [15, 8], with the
opening on its **north** side at y = 8 (post-and-rail fence W/E; open N).
- South stone wall has a **gate at x ∈ [10, 13], y = 15** (split wall +
gate posts at x=10 and x=13). So sheep that get penned end up between the
fence (N side at y=8) and the south stone wall (with the wooden gate at
y=15 currently slightly ajar). The pen is effectively an L-shape inside
the field, not external.
- Spawns: dog at origin (0, 0), 3 sheep around (3, ±2) and (4, 0). Two more
sheep are commented out.
### Robots — protos
- **Sheep** (`protos/Sheep.proto`): differential drive, wheel radius 0.031 m,
axle half-width 0.10 m → wheel base 0.20 m. `maxVelocity = 25 rad/s`
max linear ≈ **0.78 m/s**. Sensors: GPS, Compass, Emitter+Receiver on
channel 1. `supervisor = TRUE` (used to repaint wool pink on pen entry).
- **ShepherdDog** (`protos/ShepherdDog.proto`): differential drive, wheel
radius 0.038 m, axle half-width 0.14 m → wheel base 0.28 m.
`maxVelocity = 70 rad/s` → max linear ≈ **2.66 m/s**. Sensors: GPS,
Compass, Gyro, Accelerometer, **Lidar** (front-only, FOV 2.44 rad ≈ 140°,
180 rays, range 0.1012 m, noise 0.005), Emitter+Receiver on channel 1,
cosmetic ear/tail motors.
### Sheep controller — `controllers/sheep/{sheep.py,flocking.py}`
- Reynolds-style boid stack: flee (quadratic ramp inside FLEE_DIST=7 m),
cohesion (within 8 m), separation (within 2.5 m), wall soft repulsion
(margin 5 m), wall hard escape (margin 1 m, gain 50), wander.
- Pen-aware: sheep below the gate line but outside the gate corridor get a
northward "deadzone" assist; on first entry into the pen rectangle,
sheep latches `penned=True`, repaints pink, and switches to in-pen
containment + jitter.
- Driver: heading-error PD on diff-drive (k=4), forward velocity scaled by
`cos(err)`, MAX_SPEED=22 (motor units, capped by proto's 25 rad/s).
- Stuck detector: if displacement < 0.05 m for 20 steps, drives toward
field origin to escape wall-pin (a known differential-drive failure mode).
### Dog controller — `controllers/shepherd_dog/{shepherd_dog.py,strombom.py}`
- Strömbom collect/drive heuristic. CoM-radius gating
`radius > F·√n` with F=2 selects collect (push furthest sheep inward) vs
drive (push CoM toward the pen entry point at (11.5, 8.0)).
- Deadzone rescue: when a sheep is below the gate line and outside the
pen's x-corridor, the dog repositions to a "behind the sheep, opposite
the pen" stand-off so the sheep's flee vector points back through the
gate. Variants 0/1 alternate lateral offset to break corner cycles.
- Stuck-rescue, EMA action smoothing, target-deadband, RESCUE_SPEED_CAP,
cooldown — all empirical fixes for diff-drive oscillation.
- Logs full per-step debug to `dog_behavior_log.csv` (currently 7 MB —
add to `.gitignore`).
### Deleted training scaffolding (per `git status`)
- `controllers/shepherd_dog_rl/{shepherd_dog_rl.py, final_model.zip, vecnorm.pkl, plot_debug.py}`
- `training/{config.json, herding_env.py, parity_test.py, requirements.txt, train.py, train_at.py, viz.py, runs/.gitkeep}`
A previous attempt existed; we'll redesign rather than resurrect, keeping
only the lessons (parity-tested env, VecNormalize wrapper, eval cadence).
---
## 2. Design decisions
### 2.1 Pen location — keep inside-field with N gate
The user offered moving the pen *external* (through a wall hole). Tradeoffs:
| Option | Pros | Cons |
|---|---|---|
| **(A) Keep inside-field** (current) | World already built; Strömbom logic already tuned; gate corridor is short | Dog must navigate around three pen walls; adds geometric clutter |
| (B) External pen via wall hole | Cleaner field — dog only sees sheep + outer walls; pen as goal region beyond a 3 m hole at y=15 | Requires editing `field.wbt` (split south wall, add external pen walls beyond y<15); existing rescue/deadzone logic must be retuned; outside-field flocking constants don't currently apply |
**Recommendation: keep (A)** for parity with the working Strömbom controller,
but add a **simplification**: widen the pen entrance from 3 m (x ∈ [10, 13])
to 4 m (x ∈ [9.5, 13.5]) and raise the entrance line from y=8 to y=7.5
to give the dog more turning room. Optional later: gate B as a curriculum
extension (Section 7).
### 2.2 Where to train
PPO on Webots directly is too slow (real-time stepping, single env, slow
reset). The previous training scaffolding used a Python 2D sim — that is
the right approach. Constraints for sim-to-sim transfer:
1. **Use the exact same flocking math**: import `controllers/sheep/flocking.py`
from the env, do not reimplement.
2. **Use the same world constants**: import `controllers/shepherd_dog/strombom.py`
for pen geometry and Strömbom baseline.
3. **Model differential drive faithfully**: match wheel-radius, base, and
max wheel-velocity from the proto files. Heading update from
`(ω_R ω_L)·r / b`, position from `(ω_R + ω_L)·r / 2`.
4. **Match Webots step**: `basicTimeStep = 16 ms`. The sheep controller runs
at every basic step; the env will use the same `dt = 0.016 s`.
5. **Lidar deferred**: dog policy will use a *symbolic* observation
(positions of dog + sheep, plus pen geometry) — not raw lidar — for the
first iteration. Lidar-from-pixels is a much harder learning problem
and isn't required for the herding task. (See Section 7 for an
optional later upgrade.)
### 2.3 Action space for the dog
Two viable choices:
- **(a) High-level velocity vector** `(vx, vy) ∈ [1, 1]²`. The same
representation Strömbom emits today; the existing
`drive_action(vx, vy, ...)` function in `shepherd_dog.py` converts this
to wheel speeds. Decouples the policy from low-level diff-drive
oscillations and enables direct A/B against Strömbom.
- (b) Direct wheel speeds `(ω_L, ω_R) ∈ [1, 1]²`. More expressive but the
policy must learn diff-drive control from scratch — which is exactly
the source of the wall-stuck and oscillation pain we're trying to
avoid.
**Recommendation: (a)** — high-level `(vx, vy)`. Reuses the well-tuned
`drive_action` controller, which already handles `cos(err)` clamping and
turn gain. RL focuses on *strategy*, not actuation.
### 2.4 Observation space for the dog
Symbolic, fixed-size, normalized to [1, 1]:
| Field | Dim | Notes |
|---|---|---|
| Dog (x, y, cos h, sin h) | 4 | Position normalized by 15 |
| Sheep CoM (x, y) | 2 | Of *active* (not-penned) sheep |
| Sheep dispersion (radius, std-x, std-y) | 3 | Strömbom collect-vs-drive features |
| Vector dog→CoM (dx, dy, dist) | 3 | Helps the value function |
| Vector dog→pen-entry (dx, dy, dist) | 3 | |
| Vector furthest-sheep→CoM (dx, dy) | 2 | Strömbom collect target hint |
| Min sheep-to-wall distance + min dog-to-wall | 2 | Safety signal |
| Active sheep count / N_max | 1 | |
| 8-bin polar histogram of sheep around dog | 8 | Order-invariant flock shape |
Total: **28 features**. Order-invariant by construction (histogram + summary
stats), so the policy generalizes across flock sizes 1..N_max.
### 2.5 Reward
Sparse-only is too hard at flock scale; we shape conservatively.
```
r_t = w_pen · ΔN_penned # +1 per newly penned sheep
+ w_progress· (d_CoM_pen[t-1] d_CoM_pen[t]) # closer-to-pen progress
+ w_compact· (R[t-1] R[t]) # tighter flock progress
w_time · 1 # constant time penalty
w_wall · I(min_wall_dist < 1.0 m) # dog too close to wall
w_collide· I(dog within 0.3 m of any sheep) # avoid contact
+ w_done · I(all sheep penned) # terminal bonus
```
Initial weights: `w_pen=2.0, w_progress=0.5, w_compact=0.2, w_time=0.005,
w_wall=0.01, w_collide=0.05, w_done=10.0`. Tune via 1-sheep curriculum
first — if the dog learns 1-sheep cleanly, the weights are sane.
### 2.6 Episode
- Max steps: 3000 (≈ 48 s at dt=16 ms — generous).
- Termination: all sheep penned (success), dog/sheep stuck > 600 steps with
no progress (failure), step limit (timeout).
- Reset: domain-randomized — sheep count ∈ {1..N_max}, sheep positions
uniform in field minus pen+gate corridor, dog at origin ± U(2, 2).
### 2.7 Curriculum
| Stage | N_sheep | Duration (steps) | Pass criterion |
|---|---|---|---|
| 0 | 1 | 0.5 M | success ≥ 90 % |
| 1 | 2 | 1.0 M | success ≥ 80 % |
| 2 | 3 | 1.5 M | success ≥ 70 % |
| 3 | 1..3 mixed | 2.0 M | mean reward stable |
| 4 (optional) | 5 | 2.0 M | success ≥ 60 % |
Implemented by changing only `n_sheep` in the env reset.
---
## 3. Repository layout (new)
```
project/
├── controllers/
│ ├── sheep/ # unchanged
│ ├── shepherd_dog/ # Strömbom controller (renamed entry)
│ │ ├── shepherd_dog.py # mode-switch wrapper: RL | strombom
│ │ ├── strombom.py # unchanged (canonical Strömbom)
│ │ └── policy_loader.py # NEW: loads SB3 zip + VecNormalize
│ └── ...
├── herding/ # NEW: Python package, importable from env + controller
│ ├── __init__.py
│ ├── geometry.py # field/pen constants, in_pen(), wall helpers (single source of truth)
│ ├── flocking_sim.py # vectorised numpy port of flocking.py for fast batched sheep
│ ├── diffdrive.py # diff-drive integrator matching the proto specs
│ └── obs.py # observation builder shared by env and Webots controller
├── training/ # NEW
│ ├── herding_env.py # gymnasium.Env, single-agent (the dog)
│ ├── parity_test.py # asserts env trajectory ≈ Webots trajectory for fixed seeds
│ ├── train_ppo.py # SB3 PPO entry point
│ ├── eval.py # rollout + metrics (success rate, time-to-pen)
│ ├── configs/
│ │ ├── ppo_default.yaml
│ │ └── curriculum.yaml
│ ├── runs/ # tensorboard + checkpoints (.gitignored)
│ └── requirements.txt
├── docs/
│ └── project.md # unchanged
├── plan.md # this file
└── ...
```
`herding/` becomes the **single source of truth** for geometry and dynamics.
The Webots controllers and the training env both import from it, so when a
constant changes in one place it changes everywhere — eliminating the
sim/Webots-drift class of bugs.
This means the existing `controllers/sheep/flocking.py` and
`controllers/shepherd_dog/strombom.py` become thin shims that re-export
from `herding/`. Webots controllers can import `herding/` because Webots
adds the project root to `sys.path` at controller startup; we'll verify.
---
## 4. The Gymnasium environment — `training/herding_env.py`
```python
class HerdingEnv(gymnasium.Env):
metadata = {"render_modes": ["rgb_array", "human"]}
def __init__(self, n_sheep=3, max_steps=3000, dt=0.016, seed=None):
self.action_space = Box(low=-1, high=1, shape=(2,), dtype=np.float32)
self.observation_space = Box(low=-1, high=1, shape=(28,), dtype=np.float32)
...
def reset(self, *, seed=None, options=None):
# Random sheep positions in field \ pen corridor, dog near origin.
# Optional curriculum: options["n_sheep"] overrides.
...
def step(self, action):
vx, vy = action # high-level velocity intent
# Convert to wheel speeds via the same drive_action inverse used in Webots
wL, wR = self._diffdrive_inverse(vx, vy, self.dog_state)
self.dog_state = self._integrate_diffdrive(self.dog_state, wL, wR, self.dt)
# Step every sheep one boid step (vectorized in flocking_sim.py)
self.sheep_state = self._step_sheep(self.sheep_state, self.dog_state)
# Update penned set, compute reward, observation, done flags
...
```
Key points:
- **Vectorised sheep update**: re-implements `flocking.py` in numpy so 100
parallel envs with 5 sheep each take ms, not seconds. Numerical parity
with the scalar version is asserted in `parity_test.py`.
- **Same diff-drive integrator** for the dog as Webots will see at
inference. Wall + pen-fence collisions clamp position (a Webots-realistic
no-pass-through approximation).
- **Domain randomization** in reset: sheep count, spawn positions, sheep
flock-parameter jitter (±10 % on FLEE_DIST, COHESION_DIST, etc.) for
robustness.
---
## 5. Training pipeline — `training/train_ppo.py`
- **Algorithm**: SB3 `PPO` with `MlpPolicy`, `n_steps=2048`, `batch_size=256`,
`n_epochs=10`, `gamma=0.995`, `gae_lambda=0.95`, `clip_range=0.2`,
`ent_coef=0.005`, `vf_coef=0.5`, `learning_rate=3e-4`.
- **Vec envs**: `SubprocVecEnv` × 16 parallel envs (the env is pure numpy
so subprocs are CPU-cheap).
- **Normalization**: `VecNormalize(norm_obs=True, norm_reward=True,
clip_obs=10.0)`. Pickled alongside the policy zip — both required at
inference.
- **Callbacks**:
- `CheckpointCallback` every 100 k steps.
- `EvalCallback` on a separate eval env (no normalization-update) every
50 k steps; logs success rate and time-to-pen to TensorBoard.
- Custom `CurriculumCallback`: bumps `n_sheep` when eval success rate
crosses the stage threshold for 3 consecutive evals.
- **Determinism for debugging**: seed-pinned eval env so regressions are
catchable.
---
## 6. Webots integration — RL inference path
`controllers/shepherd_dog/shepherd_dog.py` becomes a thin wrapper:
```python
MODE = os.environ.get("HERDING_MODE", "rl") # "rl" | "strombom"
if MODE == "rl":
policy = policy_loader.load("training/runs/best/policy.zip",
"training/runs/best/vecnormalize.pkl")
obs_fn = build_obs # from herding/obs.py
else:
obs_fn = None # strombom path uses sheep_positions directly
while robot.step(timestep) != -1:
receive_messages()
if MODE == "rl":
obs = obs_fn(dog_xy, dog_heading, sheep_positions, ...)
action, _ = policy.predict(obs, deterministic=True)
vx, vy = action.tolist()
else:
vx, vy, mode, dbg = compute_action_debug(dog_xy, sheep_positions, PEN_ENTRY)
# plus existing rescue/cooldown/EMA layer
drive_action(vx, vy, ...)
```
A **safety supervisor** wraps the RL output: if `obs` indicates the dog is
< 0.6 m from a wall, override with the existing wall-escape behavior
(reverse + turn). This is a hard guarantee diff-drive needs because PPO
may not discover wall-escape reliably from on-policy data.
`policy_loader.py` handles the SB3 import lazily so the controller still
works with `MODE=strombom` even if SB3 is not installed in the Webots
Python environment.
---
## 7. Optional extensions (post-baseline)
- **External pen** (Section 2.1 option B): edit `field.wbt` to extend the
south wall hole into an external L-shaped pen with its own walls; update
`herding/geometry.py`; retrain stage 3 only.
- **Lidar observation**: replace symbolic obs with 36-bin downsampled
lidar + ego state; train end-to-end. Useful as the "extra merit"
dimension in the project doc.
- **Two-dog mode**: make env multi-agent, train with `MAPPO`-style shared
critic or independent PPO. The proto already supports multiple dog
instances; world only needs a second `ShepherdDog` node.
- **Mecanum comparison**: swap the dog proto for a mecanum variant; same
policy, different `_integrate_diffdrive` (becomes holonomic).
- **Sheep flock size scaling**: 5, 10, 20 — the obs is order-invariant so
the same policy generalises; just curriculum further.
---
## 8. Risks & mitigations
| Risk | Mitigation |
|---|---|
| Sim-to-Webots gap (sheep dynamics, wall friction) | `parity_test.py` asserts trajectory match within tolerance for fixed seeds; if it fails, fix the env, not the policy |
| Dog learns to wall-pin sheep against fence | Add `w_collide` penalty + min-sheep-to-wall term in obs; curriculum from 1 sheep first |
| PPO oscillation collapses into spinning | Action smoothing in env step (EMA on `(vx, vy)`, mirroring `ACTION_SMOOTH=0.35` from Strömbom controller); reward small `‖a_t a_{t-1}‖` penalty |
| Pen approach failures (sheep refuse gate) | Reuse the existing `deadzone_rescue` as a *scripted fallback* triggered when a sheep has been deadzoned > 200 steps — RL handles the common case, scripted handles the corner |
| Gym version mismatch (gymnasium vs gym) | Lock to `gymnasium>=0.29`, `stable-baselines3>=2.3` in requirements |
---
## 9. Milestones (suggested order of implementation)
1. **M0 — Refactor** (no behavior change): create `herding/` package, move
constants out of `flocking.py`/`strombom.py`, leave shims; verify
Webots still runs Strömbom unchanged. Add `dog_behavior_log.csv` to
`.gitignore`.
2. **M1 — Env & parity**: `herding_env.py`, `parity_test.py`. Asserts
sheep + dog trajectories match Webots within tolerance for 5 fixed
seeds. *Done when parity test green.*
3. **M2 — PPO baseline**: train Stage 0 (1 sheep) for 0.5 M steps; eval
in env at ≥ 90 % success.
4. **M3 — Webots inference**: load Stage 0 policy in `shepherd_dog.py`
with `HERDING_MODE=rl`; verify the dog herds 1 sheep into the pen in
the actual Webots world. *This is the sim-to-sim transfer gate.*
5. **M4 — Curriculum**: stages 13, ~5 M steps total, with checkpoints
and eval logs.
6. **M5 — Strömbom comparison**: run both controllers on a fixed eval
suite (same seeds, 1/2/3 sheep), log success rate and time-to-pen.
This is a deliverable for the project's "quantitative evaluation"
goal.
7. **M6 — Documentation**: a short README in `training/` showing how to
train, evaluate, and switch modes in Webots.
Each milestone is independently demoable. M0M3 is the critical path to
"RL works in Webots"; M4M6 polishes it for the project deliverable.
---
## 10. Decisions (locked in by implementation)
- **Pen layout**: option B (external pen). The pen sits south of the
field at x ∈ [10, 13], y ∈ [-22, -15] and is reached through the
existing 3 m gap in the south stone wall. The old in-field
quarantine fence is gone and the wooden gate is modeled as
swung-open and parked on the west gate post so the corridor is
unobstructed. This kills the deadzone class entirely.
- **Flock size**: 1..10 sheep, sampled uniformly each reset. The order-
invariant observation (CoM, dispersion, polar histogram) lets a
single policy generalise across the whole range. A curriculum widens
``max_n_sheep`` from 1 to 10 over training to keep early exploration
tractable.
- **Single-sheep mode**: handled by the same policy (n_sheep=1 is the
first stage of the curriculum and stays in the training distribution
throughout). No separate model.
- **Hardware**: GPU for training. SubprocVecEnv × 16 on CPU feeds an
MlpPolicy on GPU; ~23 h for the full curriculum.
## 11. What was built
```
herding/ # single source of truth, importable from both
geometry.py # field/pen constants, latch helpers, robot specs
flocking_sim.py # Reynolds boid step (matches Webots controller)
diffdrive.py # diff-drive kinematics + velocity↔wheels
obs.py # 28-D order-invariant observation builder
strombom.py # collect/drive heuristic (baseline + fallback)
worlds/field.wbt # external pen south of field, 10 sheep slots,
# gate parked open, in-field fence removed
controllers/sheep/sheep.py # imports from herding/, latches on
# is_penned_position
controllers/shepherd_dog/
shepherd_dog.py # mode switch (HERDING_MODE=rl|strombom),
# safety supervisor for DOG_SOUTH_LIMIT
policy_loader.py # lazy SB3 zip + VecNormalize loader
strombom.py # shim re-exporting herding.strombom
training/
herding_env.py # gymnasium.Env, action smoothing, reward shaping
train_ppo.py # SB3 PPO with VecNormalize, eval, checkpoints,
# curriculum callback
eval.py # success-rate / time-to-pen across n_sheep
parity_test.py # shape, determinism, baseline-rollout smoke test
configs/ppo_default.yaml
requirements.txt
README.md # how to train, evaluate, switch modes in Webots
```
## 12. To run
```bash
# 1. Install deps (CUDA-enabled torch wheel for GPU)
pip install -r training/requirements.txt
# 2. Smoke test
python -m training.parity_test
# 3. Train (5 M steps, ~23 h on a single GPU)
python -m training.train_ppo --out-dir training/runs/baseline
# 4. Evaluate vs Strömbom
python -m training.eval --policy training/runs/baseline/best
python -m training.eval --policy strombom
# 5. Run in Webots
export HERDING_MODE=rl
export HERDING_POLICY_DIR=$PWD/training/runs/baseline/best
webots worlds/field.wbt
```