Checkpoint 2

2026-05-07 22:00:10 +01:00
parent 90aa3bbcb4
commit 1bb9415414
37 changed files with 3068 additions and 2912 deletions
@@ -0,0 +1,458 @@
+# RL-Driven Shepherd Herding — Implementation Plan
+
+This plan turns the existing Strömbom-only Webots project into a dual-mode
+shepherd controller (RL primary, Strömbom fallback), with a fast Gymnasium
+training environment that mirrors the Webots dynamics tightly enough for
+sim-to-sim transfer. Stable-Baselines3 PPO is the learner.
+
+---
+
+## 1. Current state (audit)
+
+### World geometry — `worlds/field.wbt`
+- Field bounded by stone walls at **x,y ∈ [−15, +15]**. Inside-usable area is
+  ~[−14.5, 14.5] (`X_MIN/MAX` in `flocking.py`).
+- **Pen is *inside* the field**: x ∈ [10, 13], y ∈ [−15, −8], with the
+  opening on its **north** side at y = −8 (post-and-rail fence W/E; open N).
+- South stone wall has a **gate at x ∈ [10, 13], y = −15** (split wall +
+  gate posts at x=10 and x=13). So sheep that get penned end up between the
+  fence (N side at y=−8) and the south stone wall (with the wooden gate at
+  y=−15 currently slightly ajar). The pen is effectively an L-shape inside
+  the field, not external.
+- Spawns: dog at origin (0, 0), 3 sheep around (3, ±2) and (4, 0). Two more
+  sheep are commented out.
+
+### Robots — protos
+- **Sheep** (`protos/Sheep.proto`): differential drive, wheel radius 0.031 m,
+  axle half-width 0.10 m → wheel base 0.20 m. `maxVelocity = 25 rad/s` →
+  max linear ≈ **0.78 m/s**. Sensors: GPS, Compass, Emitter+Receiver on
+  channel 1. `supervisor = TRUE` (used to repaint wool pink on pen entry).
+- **ShepherdDog** (`protos/ShepherdDog.proto`): differential drive, wheel
+  radius 0.038 m, axle half-width 0.14 m → wheel base 0.28 m.
+  `maxVelocity = 70 rad/s` → max linear ≈ **2.66 m/s**. Sensors: GPS,
+  Compass, Gyro, Accelerometer, **Lidar** (front-only, FOV 2.44 rad ≈ 140°,
+  180 rays, range 0.10–12 m, noise 0.005), Emitter+Receiver on channel 1,
+  cosmetic ear/tail motors.
+
+### Sheep controller — `controllers/sheep/{sheep.py,flocking.py}`
+- Reynolds-style boid stack: flee (quadratic ramp inside FLEE_DIST=7 m),
+  cohesion (within 8 m), separation (within 2.5 m), wall soft repulsion
+  (margin 5 m), wall hard escape (margin 1 m, gain 50), wander.
+- Pen-aware: sheep below the gate line but outside the gate corridor get a
+  northward "deadzone" assist; on first entry into the pen rectangle,
+  sheep latches `penned=True`, repaints pink, and switches to in-pen
+  containment + jitter.
+- Driver: heading-error PD on diff-drive (k=4), forward velocity scaled by
+  `cos(err)`, MAX_SPEED=22 (motor units, capped by proto's 25 rad/s).
+- Stuck detector: if displacement < 0.05 m for 20 steps, drives toward
+  field origin to escape wall-pin (a known differential-drive failure mode).
+
+### Dog controller — `controllers/shepherd_dog/{shepherd_dog.py,strombom.py}`
+- Strömbom collect/drive heuristic. CoM-radius gating
+  `radius > F·√n` with F=2 selects collect (push furthest sheep inward) vs
+  drive (push CoM toward the pen entry point at (11.5, −8.0)).
+- Deadzone rescue: when a sheep is below the gate line and outside the
+  pen's x-corridor, the dog repositions to a "behind the sheep, opposite
+  the pen" stand-off so the sheep's flee vector points back through the
+  gate. Variants 0/1 alternate lateral offset to break corner cycles.
+- Stuck-rescue, EMA action smoothing, target-deadband, RESCUE_SPEED_CAP,
+  cooldown — all empirical fixes for diff-drive oscillation.
+- Logs full per-step debug to `dog_behavior_log.csv` (currently 7 MB —
+  add to `.gitignore`).
+
+### Deleted training scaffolding (per `git status`)
+- `controllers/shepherd_dog_rl/{shepherd_dog_rl.py, final_model.zip, vecnorm.pkl, plot_debug.py}`
+- `training/{config.json, herding_env.py, parity_test.py, requirements.txt, train.py, train_at.py, viz.py, runs/.gitkeep}`
+
+A previous attempt existed; we'll redesign rather than resurrect, keeping
+only the lessons (parity-tested env, VecNormalize wrapper, eval cadence).
+
+---
+
+## 2. Design decisions
+
+### 2.1 Pen location — keep inside-field with N gate
+The user offered moving the pen *external* (through a wall hole). Tradeoffs:
+
+| Option | Pros | Cons |
+|---|---|---|
+| **(A) Keep inside-field** (current) | World already built; Strömbom logic already tuned; gate corridor is short | Dog must navigate around three pen walls; adds geometric clutter |
+| (B) External pen via wall hole | Cleaner field — dog only sees sheep + outer walls; pen as goal region beyond a 3 m hole at y=−15 | Requires editing `field.wbt` (split south wall, add external pen walls beyond y<−15); existing rescue/deadzone logic must be retuned; outside-field flocking constants don't currently apply |
+
+**Recommendation: keep (A)** for parity with the working Strömbom controller,
+but add a **simplification**: widen the pen entrance from 3 m (x ∈ [10, 13])
+to 4 m (x ∈ [9.5, 13.5]) and raise the entrance line from y=−8 to y=−7.5
+to give the dog more turning room. Optional later: gate B as a curriculum
+extension (Section 7).
+
+### 2.2 Where to train
+
+PPO on Webots directly is too slow (real-time stepping, single env, slow
+reset). The previous training scaffolding used a Python 2D sim — that is
+the right approach. Constraints for sim-to-sim transfer:
+
+1. **Use the exact same flocking math**: import `controllers/sheep/flocking.py`
+   from the env, do not reimplement.
+2. **Use the same world constants**: import `controllers/shepherd_dog/strombom.py`
+   for pen geometry and Strömbom baseline.
+3. **Model differential drive faithfully**: match wheel-radius, base, and
+   max wheel-velocity from the proto files. Heading update from
+   `(ω_R − ω_L)·r / b`, position from `(ω_R + ω_L)·r / 2`.
+4. **Match Webots step**: `basicTimeStep = 16 ms`. The sheep controller runs
+   at every basic step; the env will use the same `dt = 0.016 s`.
+5. **Lidar deferred**: dog policy will use a *symbolic* observation
+   (positions of dog + sheep, plus pen geometry) — not raw lidar — for the
+   first iteration. Lidar-from-pixels is a much harder learning problem
+   and isn't required for the herding task. (See Section 7 for an
+   optional later upgrade.)
+
+### 2.3 Action space for the dog
+
+Two viable choices:
+
+- **(a) High-level velocity vector** `(vx, vy) ∈ [−1, 1]²`. The same
+  representation Strömbom emits today; the existing
+  `drive_action(vx, vy, ...)` function in `shepherd_dog.py` converts this
+  to wheel speeds. Decouples the policy from low-level diff-drive
+  oscillations and enables direct A/B against Strömbom.
+- (b) Direct wheel speeds `(ω_L, ω_R) ∈ [−1, 1]²`. More expressive but the
+  policy must learn diff-drive control from scratch — which is exactly
+  the source of the wall-stuck and oscillation pain we're trying to
+  avoid.
+
+**Recommendation: (a)** — high-level `(vx, vy)`. Reuses the well-tuned
+`drive_action` controller, which already handles `cos(err)` clamping and
+turn gain. RL focuses on *strategy*, not actuation.
+
+### 2.4 Observation space for the dog
+
+Symbolic, fixed-size, normalized to [−1, 1]:
+
+| Field | Dim | Notes |
+|---|---|---|
+| Dog (x, y, cos h, sin h) | 4 | Position normalized by 15 |
+| Sheep CoM (x, y) | 2 | Of *active* (not-penned) sheep |
+| Sheep dispersion (radius, std-x, std-y) | 3 | Strömbom collect-vs-drive features |
+| Vector dog→CoM (dx, dy, dist) | 3 | Helps the value function |
+| Vector dog→pen-entry (dx, dy, dist) | 3 | |
+| Vector furthest-sheep→CoM (dx, dy) | 2 | Strömbom collect target hint |
+| Min sheep-to-wall distance + min dog-to-wall | 2 | Safety signal |
+| Active sheep count / N_max | 1 | |
+| 8-bin polar histogram of sheep around dog | 8 | Order-invariant flock shape |
+
+Total: **28 features**. Order-invariant by construction (histogram + summary
+stats), so the policy generalizes across flock sizes 1..N_max.
+
+### 2.5 Reward
+
+Sparse-only is too hard at flock scale; we shape conservatively.
+
+```
+r_t = w_pen     · ΔN_penned                       # +1 per newly penned sheep
+    + w_progress· (d_CoM_pen[t-1] − d_CoM_pen[t]) # closer-to-pen progress
+    + w_compact· (R[t-1] − R[t])                  # tighter flock progress
+    − w_time   · 1                                 # constant time penalty
+    − w_wall   · I(min_wall_dist < 1.0 m)         # dog too close to wall
+    − w_collide· I(dog within 0.3 m of any sheep) # avoid contact
+    + w_done   · I(all sheep penned)              # terminal bonus
+```
+
+Initial weights: `w_pen=2.0, w_progress=0.5, w_compact=0.2, w_time=0.005,
+w_wall=0.01, w_collide=0.05, w_done=10.0`. Tune via 1-sheep curriculum
+first — if the dog learns 1-sheep cleanly, the weights are sane.
+
+### 2.6 Episode
+
+- Max steps: 3000 (≈ 48 s at dt=16 ms — generous).
+- Termination: all sheep penned (success), dog/sheep stuck > 600 steps with
+  no progress (failure), step limit (timeout).
+- Reset: domain-randomized — sheep count ∈ {1..N_max}, sheep positions
+  uniform in field minus pen+gate corridor, dog at origin ± U(−2, 2).
+
+### 2.7 Curriculum
+
+| Stage | N_sheep | Duration (steps) | Pass criterion |
+|---|---|---|---|
+| 0 | 1 | 0.5 M | success ≥ 90 % |
+| 1 | 2 | 1.0 M | success ≥ 80 % |
+| 2 | 3 | 1.5 M | success ≥ 70 % |
+| 3 | 1..3 mixed | 2.0 M | mean reward stable |
+| 4 (optional) | 5 | 2.0 M | success ≥ 60 % |
+
+Implemented by changing only `n_sheep` in the env reset.
+
+---
+
+## 3. Repository layout (new)
+
+```
+project/
+├── controllers/
+│   ├── sheep/                      # unchanged
+│   ├── shepherd_dog/               # Strömbom controller (renamed entry)
+│   │   ├── shepherd_dog.py         # mode-switch wrapper: RL | strombom
+│   │   ├── strombom.py             # unchanged (canonical Strömbom)
+│   │   └── policy_loader.py        # NEW: loads SB3 zip + VecNormalize
+│   └── ...
+├── herding/                        # NEW: Python package, importable from env + controller
+│   ├── __init__.py
+│   ├── geometry.py                 # field/pen constants, in_pen(), wall helpers (single source of truth)
+│   ├── flocking_sim.py             # vectorised numpy port of flocking.py for fast batched sheep
+│   ├── diffdrive.py                # diff-drive integrator matching the proto specs
+│   └── obs.py                      # observation builder shared by env and Webots controller
+├── training/                       # NEW
+│   ├── herding_env.py              # gymnasium.Env, single-agent (the dog)
+│   ├── parity_test.py              # asserts env trajectory ≈ Webots trajectory for fixed seeds
+│   ├── train_ppo.py                # SB3 PPO entry point
+│   ├── eval.py                     # rollout + metrics (success rate, time-to-pen)
+│   ├── configs/
+│   │   ├── ppo_default.yaml
+│   │   └── curriculum.yaml
+│   ├── runs/                       # tensorboard + checkpoints (.gitignored)
+│   └── requirements.txt
+├── docs/
+│   └── project.md                  # unchanged
+├── plan.md                         # this file
+└── ...
+```
+
+`herding/` becomes the **single source of truth** for geometry and dynamics.
+The Webots controllers and the training env both import from it, so when a
+constant changes in one place it changes everywhere — eliminating the
+sim/Webots-drift class of bugs.
+
+This means the existing `controllers/sheep/flocking.py` and
+`controllers/shepherd_dog/strombom.py` become thin shims that re-export
+from `herding/`. Webots controllers can import `herding/` because Webots
+adds the project root to `sys.path` at controller startup; we'll verify.
+
+---
+
+## 4. The Gymnasium environment — `training/herding_env.py`
+
+```python
+class HerdingEnv(gymnasium.Env):
+    metadata = {"render_modes": ["rgb_array", "human"]}
+
+    def __init__(self, n_sheep=3, max_steps=3000, dt=0.016, seed=None):
+        self.action_space      = Box(low=-1, high=1, shape=(2,), dtype=np.float32)
+        self.observation_space = Box(low=-1, high=1, shape=(28,), dtype=np.float32)
+        ...
+
+    def reset(self, *, seed=None, options=None):
+        # Random sheep positions in field \ pen corridor, dog near origin.
+        # Optional curriculum: options["n_sheep"] overrides.
+        ...
+
+    def step(self, action):
+        vx, vy = action  # high-level velocity intent
+        # Convert to wheel speeds via the same drive_action inverse used in Webots
+        wL, wR = self._diffdrive_inverse(vx, vy, self.dog_state)
+        self.dog_state = self._integrate_diffdrive(self.dog_state, wL, wR, self.dt)
+        # Step every sheep one boid step (vectorized in flocking_sim.py)
+        self.sheep_state = self._step_sheep(self.sheep_state, self.dog_state)
+        # Update penned set, compute reward, observation, done flags
+        ...
+```
+
+Key points:
+- **Vectorised sheep update**: re-implements `flocking.py` in numpy so 100
+  parallel envs with 5 sheep each take ms, not seconds. Numerical parity
+  with the scalar version is asserted in `parity_test.py`.
+- **Same diff-drive integrator** for the dog as Webots will see at
+  inference. Wall + pen-fence collisions clamp position (a Webots-realistic
+  no-pass-through approximation).
+- **Domain randomization** in reset: sheep count, spawn positions, sheep
+  flock-parameter jitter (±10 % on FLEE_DIST, COHESION_DIST, etc.) for
+  robustness.
+
+---
+
+## 5. Training pipeline — `training/train_ppo.py`
+
+- **Algorithm**: SB3 `PPO` with `MlpPolicy`, `n_steps=2048`, `batch_size=256`,
+  `n_epochs=10`, `gamma=0.995`, `gae_lambda=0.95`, `clip_range=0.2`,
+  `ent_coef=0.005`, `vf_coef=0.5`, `learning_rate=3e-4`.
+- **Vec envs**: `SubprocVecEnv` × 16 parallel envs (the env is pure numpy
+  so subprocs are CPU-cheap).
+- **Normalization**: `VecNormalize(norm_obs=True, norm_reward=True,
+  clip_obs=10.0)`. Pickled alongside the policy zip — both required at
+  inference.
+- **Callbacks**:
+  - `CheckpointCallback` every 100 k steps.
+  - `EvalCallback` on a separate eval env (no normalization-update) every
+    50 k steps; logs success rate and time-to-pen to TensorBoard.
+  - Custom `CurriculumCallback`: bumps `n_sheep` when eval success rate
+    crosses the stage threshold for 3 consecutive evals.
+- **Determinism for debugging**: seed-pinned eval env so regressions are
+  catchable.
+
+---
+
+## 6. Webots integration — RL inference path
+
+`controllers/shepherd_dog/shepherd_dog.py` becomes a thin wrapper:
+
+```python
+MODE = os.environ.get("HERDING_MODE", "rl")  # "rl" | "strombom"
+
+if MODE == "rl":
+    policy = policy_loader.load("training/runs/best/policy.zip",
+                                "training/runs/best/vecnormalize.pkl")
+    obs_fn = build_obs   # from herding/obs.py
+else:
+    obs_fn = None        # strombom path uses sheep_positions directly
+
+while robot.step(timestep) != -1:
+    receive_messages()
+    if MODE == "rl":
+        obs = obs_fn(dog_xy, dog_heading, sheep_positions, ...)
+        action, _ = policy.predict(obs, deterministic=True)
+        vx, vy = action.tolist()
+    else:
+        vx, vy, mode, dbg = compute_action_debug(dog_xy, sheep_positions, PEN_ENTRY)
+        # plus existing rescue/cooldown/EMA layer
+    drive_action(vx, vy, ...)
+```
+
+A **safety supervisor** wraps the RL output: if `obs` indicates the dog is
+< 0.6 m from a wall, override with the existing wall-escape behavior
+(reverse + turn). This is a hard guarantee diff-drive needs because PPO
+may not discover wall-escape reliably from on-policy data.
+
+`policy_loader.py` handles the SB3 import lazily so the controller still
+works with `MODE=strombom` even if SB3 is not installed in the Webots
+Python environment.
+
+---
+
+## 7. Optional extensions (post-baseline)
+
+- **External pen** (Section 2.1 option B): edit `field.wbt` to extend the
+  south wall hole into an external L-shaped pen with its own walls; update
+  `herding/geometry.py`; retrain stage 3 only.
+- **Lidar observation**: replace symbolic obs with 36-bin downsampled
+  lidar + ego state; train end-to-end. Useful as the "extra merit"
+  dimension in the project doc.
+- **Two-dog mode**: make env multi-agent, train with `MAPPO`-style shared
+  critic or independent PPO. The proto already supports multiple dog
+  instances; world only needs a second `ShepherdDog` node.
+- **Mecanum comparison**: swap the dog proto for a mecanum variant; same
+  policy, different `_integrate_diffdrive` (becomes holonomic).
+- **Sheep flock size scaling**: 5, 10, 20 — the obs is order-invariant so
+  the same policy generalises; just curriculum further.
+
+---
+
+## 8. Risks & mitigations
+
+| Risk | Mitigation |
+|---|---|
+| Sim-to-Webots gap (sheep dynamics, wall friction) | `parity_test.py` asserts trajectory match within tolerance for fixed seeds; if it fails, fix the env, not the policy |
+| Dog learns to wall-pin sheep against fence | Add `w_collide` penalty + min-sheep-to-wall term in obs; curriculum from 1 sheep first |
+| PPO oscillation collapses into spinning | Action smoothing in env step (EMA on `(vx, vy)`, mirroring `ACTION_SMOOTH=0.35` from Strömbom controller); reward small `‖a_t − a_{t-1}‖` penalty |
+| Pen approach failures (sheep refuse gate) | Reuse the existing `deadzone_rescue` as a *scripted fallback* triggered when a sheep has been deadzoned > 200 steps — RL handles the common case, scripted handles the corner |
+| Gym version mismatch (gymnasium vs gym) | Lock to `gymnasium>=0.29`, `stable-baselines3>=2.3` in requirements |
+
+---
+
+## 9. Milestones (suggested order of implementation)
+
+1. **M0 — Refactor** (no behavior change): create `herding/` package, move
+   constants out of `flocking.py`/`strombom.py`, leave shims; verify
+   Webots still runs Strömbom unchanged. Add `dog_behavior_log.csv` to
+   `.gitignore`.
+2. **M1 — Env & parity**: `herding_env.py`, `parity_test.py`. Asserts
+   sheep + dog trajectories match Webots within tolerance for 5 fixed
+   seeds. *Done when parity test green.*
+3. **M2 — PPO baseline**: train Stage 0 (1 sheep) for 0.5 M steps; eval
+   in env at ≥ 90 % success.
+4. **M3 — Webots inference**: load Stage 0 policy in `shepherd_dog.py`
+   with `HERDING_MODE=rl`; verify the dog herds 1 sheep into the pen in
+   the actual Webots world. *This is the sim-to-sim transfer gate.*
+5. **M4 — Curriculum**: stages 1–3, ~5 M steps total, with checkpoints
+   and eval logs.
+6. **M5 — Strömbom comparison**: run both controllers on a fixed eval
+   suite (same seeds, 1/2/3 sheep), log success rate and time-to-pen.
+   This is a deliverable for the project's "quantitative evaluation"
+   goal.
+7. **M6 — Documentation**: a short README in `training/` showing how to
+   train, evaluate, and switch modes in Webots.
+
+Each milestone is independently demoable. M0–M3 is the critical path to
+"RL works in Webots"; M4–M6 polishes it for the project deliverable.
+
+---
+
+## 10. Decisions (locked in by implementation)
+
+- **Pen layout**: option B (external pen). The pen sits south of the
+  field at x ∈ [10, 13], y ∈ [-22, -15] and is reached through the
+  existing 3 m gap in the south stone wall. The old in-field
+  quarantine fence is gone and the wooden gate is modeled as
+  swung-open and parked on the west gate post so the corridor is
+  unobstructed. This kills the deadzone class entirely.
+- **Flock size**: 1..10 sheep, sampled uniformly each reset. The order-
+  invariant observation (CoM, dispersion, polar histogram) lets a
+  single policy generalise across the whole range. A curriculum widens
+  ``max_n_sheep`` from 1 to 10 over training to keep early exploration
+  tractable.
+- **Single-sheep mode**: handled by the same policy (n_sheep=1 is the
+  first stage of the curriculum and stays in the training distribution
+  throughout). No separate model.
+- **Hardware**: GPU for training. SubprocVecEnv × 16 on CPU feeds an
+  MlpPolicy on GPU; ~2–3 h for the full curriculum.
+
+## 11. What was built
+
+```
+herding/                    # single source of truth, importable from both
+  geometry.py               # field/pen constants, latch helpers, robot specs
+  flocking_sim.py           # Reynolds boid step (matches Webots controller)
+  diffdrive.py              # diff-drive kinematics + velocity↔wheels
+  obs.py                    # 28-D order-invariant observation builder
+  strombom.py               # collect/drive heuristic (baseline + fallback)
+
+worlds/field.wbt            # external pen south of field, 10 sheep slots,
+                            # gate parked open, in-field fence removed
+
+controllers/sheep/sheep.py            # imports from herding/, latches on
+                                      # is_penned_position
+controllers/shepherd_dog/
+  shepherd_dog.py           # mode switch (HERDING_MODE=rl|strombom),
+                            # safety supervisor for DOG_SOUTH_LIMIT
+  policy_loader.py          # lazy SB3 zip + VecNormalize loader
+  strombom.py               # shim re-exporting herding.strombom
+
+training/
+  herding_env.py            # gymnasium.Env, action smoothing, reward shaping
+  train_ppo.py              # SB3 PPO with VecNormalize, eval, checkpoints,
+                            # curriculum callback
+  eval.py                   # success-rate / time-to-pen across n_sheep
+  parity_test.py            # shape, determinism, baseline-rollout smoke test
+  configs/ppo_default.yaml
+  requirements.txt
+  README.md                 # how to train, evaluate, switch modes in Webots
+```
+
+## 12. To run
+
+```bash
+# 1. Install deps (CUDA-enabled torch wheel for GPU)
+pip install -r training/requirements.txt
+
+# 2. Smoke test
+python -m training.parity_test
+
+# 3. Train (5 M steps, ~2–3 h on a single GPU)
+python -m training.train_ppo --out-dir training/runs/baseline
+
+# 4. Evaluate vs Strömbom
+python -m training.eval --policy training/runs/baseline/best
+python -m training.eval --policy strombom
+
+# 5. Run in Webots
+export HERDING_MODE=rl
+export HERDING_POLICY_DIR=$PWD/training/runs/baseline/best
+webots worlds/field.wbt
+```