Files
TIR_PROJ/plan.md
T
Johnny Fernandes 1bb9415414 Checkpoint 2
2026-05-07 22:00:10 +01:00

21 KiB
Raw Blame History

RL-Driven Shepherd Herding — Implementation Plan

This plan turns the existing Strömbom-only Webots project into a dual-mode shepherd controller (RL primary, Strömbom fallback), with a fast Gymnasium training environment that mirrors the Webots dynamics tightly enough for sim-to-sim transfer. Stable-Baselines3 PPO is the learner.


1. Current state (audit)

World geometry — worlds/field.wbt

  • Field bounded by stone walls at x,y ∈ [15, +15]. Inside-usable area is ~[14.5, 14.5] (X_MIN/MAX in flocking.py).
  • Pen is inside the field: x ∈ [10, 13], y ∈ [15, 8], with the opening on its north side at y = 8 (post-and-rail fence W/E; open N).
  • South stone wall has a gate at x ∈ [10, 13], y = 15 (split wall + gate posts at x=10 and x=13). So sheep that get penned end up between the fence (N side at y=8) and the south stone wall (with the wooden gate at y=15 currently slightly ajar). The pen is effectively an L-shape inside the field, not external.
  • Spawns: dog at origin (0, 0), 3 sheep around (3, ±2) and (4, 0). Two more sheep are commented out.

Robots — protos

  • Sheep (protos/Sheep.proto): differential drive, wheel radius 0.031 m, axle half-width 0.10 m → wheel base 0.20 m. maxVelocity = 25 rad/s → max linear ≈ 0.78 m/s. Sensors: GPS, Compass, Emitter+Receiver on channel 1. supervisor = TRUE (used to repaint wool pink on pen entry).
  • ShepherdDog (protos/ShepherdDog.proto): differential drive, wheel radius 0.038 m, axle half-width 0.14 m → wheel base 0.28 m. maxVelocity = 70 rad/s → max linear ≈ 2.66 m/s. Sensors: GPS, Compass, Gyro, Accelerometer, Lidar (front-only, FOV 2.44 rad ≈ 140°, 180 rays, range 0.1012 m, noise 0.005), Emitter+Receiver on channel 1, cosmetic ear/tail motors.

Sheep controller — controllers/sheep/{sheep.py,flocking.py}

  • Reynolds-style boid stack: flee (quadratic ramp inside FLEE_DIST=7 m), cohesion (within 8 m), separation (within 2.5 m), wall soft repulsion (margin 5 m), wall hard escape (margin 1 m, gain 50), wander.
  • Pen-aware: sheep below the gate line but outside the gate corridor get a northward "deadzone" assist; on first entry into the pen rectangle, sheep latches penned=True, repaints pink, and switches to in-pen containment + jitter.
  • Driver: heading-error PD on diff-drive (k=4), forward velocity scaled by cos(err), MAX_SPEED=22 (motor units, capped by proto's 25 rad/s).
  • Stuck detector: if displacement < 0.05 m for 20 steps, drives toward field origin to escape wall-pin (a known differential-drive failure mode).

Dog controller — controllers/shepherd_dog/{shepherd_dog.py,strombom.py}

  • Strömbom collect/drive heuristic. CoM-radius gating radius > F·√n with F=2 selects collect (push furthest sheep inward) vs drive (push CoM toward the pen entry point at (11.5, 8.0)).
  • Deadzone rescue: when a sheep is below the gate line and outside the pen's x-corridor, the dog repositions to a "behind the sheep, opposite the pen" stand-off so the sheep's flee vector points back through the gate. Variants 0/1 alternate lateral offset to break corner cycles.
  • Stuck-rescue, EMA action smoothing, target-deadband, RESCUE_SPEED_CAP, cooldown — all empirical fixes for diff-drive oscillation.
  • Logs full per-step debug to dog_behavior_log.csv (currently 7 MB — add to .gitignore).

Deleted training scaffolding (per git status)

  • controllers/shepherd_dog_rl/{shepherd_dog_rl.py, final_model.zip, vecnorm.pkl, plot_debug.py}
  • training/{config.json, herding_env.py, parity_test.py, requirements.txt, train.py, train_at.py, viz.py, runs/.gitkeep}

A previous attempt existed; we'll redesign rather than resurrect, keeping only the lessons (parity-tested env, VecNormalize wrapper, eval cadence).


2. Design decisions

2.1 Pen location — keep inside-field with N gate

The user offered moving the pen external (through a wall hole). Tradeoffs:

Option Pros Cons
(A) Keep inside-field (current) World already built; Strömbom logic already tuned; gate corridor is short Dog must navigate around three pen walls; adds geometric clutter
(B) External pen via wall hole Cleaner field — dog only sees sheep + outer walls; pen as goal region beyond a 3 m hole at y=15 Requires editing field.wbt (split south wall, add external pen walls beyond y<15); existing rescue/deadzone logic must be retuned; outside-field flocking constants don't currently apply

Recommendation: keep (A) for parity with the working Strömbom controller, but add a simplification: widen the pen entrance from 3 m (x ∈ [10, 13]) to 4 m (x ∈ [9.5, 13.5]) and raise the entrance line from y=8 to y=7.5 to give the dog more turning room. Optional later: gate B as a curriculum extension (Section 7).

2.2 Where to train

PPO on Webots directly is too slow (real-time stepping, single env, slow reset). The previous training scaffolding used a Python 2D sim — that is the right approach. Constraints for sim-to-sim transfer:

  1. Use the exact same flocking math: import controllers/sheep/flocking.py from the env, do not reimplement.
  2. Use the same world constants: import controllers/shepherd_dog/strombom.py for pen geometry and Strömbom baseline.
  3. Model differential drive faithfully: match wheel-radius, base, and max wheel-velocity from the proto files. Heading update from (ω_R ω_L)·r / b, position from (ω_R + ω_L)·r / 2.
  4. Match Webots step: basicTimeStep = 16 ms. The sheep controller runs at every basic step; the env will use the same dt = 0.016 s.
  5. Lidar deferred: dog policy will use a symbolic observation (positions of dog + sheep, plus pen geometry) — not raw lidar — for the first iteration. Lidar-from-pixels is a much harder learning problem and isn't required for the herding task. (See Section 7 for an optional later upgrade.)

2.3 Action space for the dog

Two viable choices:

  • (a) High-level velocity vector (vx, vy) ∈ [1, 1]². The same representation Strömbom emits today; the existing drive_action(vx, vy, ...) function in shepherd_dog.py converts this to wheel speeds. Decouples the policy from low-level diff-drive oscillations and enables direct A/B against Strömbom.
  • (b) Direct wheel speeds (ω_L, ω_R) ∈ [1, 1]². More expressive but the policy must learn diff-drive control from scratch — which is exactly the source of the wall-stuck and oscillation pain we're trying to avoid.

Recommendation: (a) — high-level (vx, vy). Reuses the well-tuned drive_action controller, which already handles cos(err) clamping and turn gain. RL focuses on strategy, not actuation.

2.4 Observation space for the dog

Symbolic, fixed-size, normalized to [1, 1]:

Field Dim Notes
Dog (x, y, cos h, sin h) 4 Position normalized by 15
Sheep CoM (x, y) 2 Of active (not-penned) sheep
Sheep dispersion (radius, std-x, std-y) 3 Strömbom collect-vs-drive features
Vector dog→CoM (dx, dy, dist) 3 Helps the value function
Vector dog→pen-entry (dx, dy, dist) 3
Vector furthest-sheep→CoM (dx, dy) 2 Strömbom collect target hint
Min sheep-to-wall distance + min dog-to-wall 2 Safety signal
Active sheep count / N_max 1
8-bin polar histogram of sheep around dog 8 Order-invariant flock shape

Total: 28 features. Order-invariant by construction (histogram + summary stats), so the policy generalizes across flock sizes 1..N_max.

2.5 Reward

Sparse-only is too hard at flock scale; we shape conservatively.

r_t = w_pen     · ΔN_penned                       # +1 per newly penned sheep
    + w_progress· (d_CoM_pen[t-1]  d_CoM_pen[t]) # closer-to-pen progress
    + w_compact· (R[t-1]  R[t])                  # tighter flock progress
     w_time   · 1                                 # constant time penalty
     w_wall   · I(min_wall_dist < 1.0 m)         # dog too close to wall
     w_collide· I(dog within 0.3 m of any sheep) # avoid contact
    + w_done   · I(all sheep penned)              # terminal bonus

Initial weights: w_pen=2.0, w_progress=0.5, w_compact=0.2, w_time=0.005, w_wall=0.01, w_collide=0.05, w_done=10.0. Tune via 1-sheep curriculum first — if the dog learns 1-sheep cleanly, the weights are sane.

2.6 Episode

  • Max steps: 3000 (≈ 48 s at dt=16 ms — generous).
  • Termination: all sheep penned (success), dog/sheep stuck > 600 steps with no progress (failure), step limit (timeout).
  • Reset: domain-randomized — sheep count ∈ {1..N_max}, sheep positions uniform in field minus pen+gate corridor, dog at origin ± U(2, 2).

2.7 Curriculum

Stage N_sheep Duration (steps) Pass criterion
0 1 0.5 M success ≥ 90 %
1 2 1.0 M success ≥ 80 %
2 3 1.5 M success ≥ 70 %
3 1..3 mixed 2.0 M mean reward stable
4 (optional) 5 2.0 M success ≥ 60 %

Implemented by changing only n_sheep in the env reset.


3. Repository layout (new)

project/
├── controllers/
│   ├── sheep/                      # unchanged
│   ├── shepherd_dog/               # Strömbom controller (renamed entry)
│   │   ├── shepherd_dog.py         # mode-switch wrapper: RL | strombom
│   │   ├── strombom.py             # unchanged (canonical Strömbom)
│   │   └── policy_loader.py        # NEW: loads SB3 zip + VecNormalize
│   └── ...
├── herding/                        # NEW: Python package, importable from env + controller
│   ├── __init__.py
│   ├── geometry.py                 # field/pen constants, in_pen(), wall helpers (single source of truth)
│   ├── flocking_sim.py             # vectorised numpy port of flocking.py for fast batched sheep
│   ├── diffdrive.py                # diff-drive integrator matching the proto specs
│   └── obs.py                      # observation builder shared by env and Webots controller
├── training/                       # NEW
│   ├── herding_env.py              # gymnasium.Env, single-agent (the dog)
│   ├── parity_test.py              # asserts env trajectory ≈ Webots trajectory for fixed seeds
│   ├── train_ppo.py                # SB3 PPO entry point
│   ├── eval.py                     # rollout + metrics (success rate, time-to-pen)
│   ├── configs/
│   │   ├── ppo_default.yaml
│   │   └── curriculum.yaml
│   ├── runs/                       # tensorboard + checkpoints (.gitignored)
│   └── requirements.txt
├── docs/
│   └── project.md                  # unchanged
├── plan.md                         # this file
└── ...

herding/ becomes the single source of truth for geometry and dynamics. The Webots controllers and the training env both import from it, so when a constant changes in one place it changes everywhere — eliminating the sim/Webots-drift class of bugs.

This means the existing controllers/sheep/flocking.py and controllers/shepherd_dog/strombom.py become thin shims that re-export from herding/. Webots controllers can import herding/ because Webots adds the project root to sys.path at controller startup; we'll verify.


4. The Gymnasium environment — training/herding_env.py

class HerdingEnv(gymnasium.Env):
    metadata = {"render_modes": ["rgb_array", "human"]}

    def __init__(self, n_sheep=3, max_steps=3000, dt=0.016, seed=None):
        self.action_space      = Box(low=-1, high=1, shape=(2,), dtype=np.float32)
        self.observation_space = Box(low=-1, high=1, shape=(28,), dtype=np.float32)
        ...

    def reset(self, *, seed=None, options=None):
        # Random sheep positions in field \ pen corridor, dog near origin.
        # Optional curriculum: options["n_sheep"] overrides.
        ...

    def step(self, action):
        vx, vy = action  # high-level velocity intent
        # Convert to wheel speeds via the same drive_action inverse used in Webots
        wL, wR = self._diffdrive_inverse(vx, vy, self.dog_state)
        self.dog_state = self._integrate_diffdrive(self.dog_state, wL, wR, self.dt)
        # Step every sheep one boid step (vectorized in flocking_sim.py)
        self.sheep_state = self._step_sheep(self.sheep_state, self.dog_state)
        # Update penned set, compute reward, observation, done flags
        ...

Key points:

  • Vectorised sheep update: re-implements flocking.py in numpy so 100 parallel envs with 5 sheep each take ms, not seconds. Numerical parity with the scalar version is asserted in parity_test.py.
  • Same diff-drive integrator for the dog as Webots will see at inference. Wall + pen-fence collisions clamp position (a Webots-realistic no-pass-through approximation).
  • Domain randomization in reset: sheep count, spawn positions, sheep flock-parameter jitter (±10 % on FLEE_DIST, COHESION_DIST, etc.) for robustness.

5. Training pipeline — training/train_ppo.py

  • Algorithm: SB3 PPO with MlpPolicy, n_steps=2048, batch_size=256, n_epochs=10, gamma=0.995, gae_lambda=0.95, clip_range=0.2, ent_coef=0.005, vf_coef=0.5, learning_rate=3e-4.
  • Vec envs: SubprocVecEnv × 16 parallel envs (the env is pure numpy so subprocs are CPU-cheap).
  • Normalization: VecNormalize(norm_obs=True, norm_reward=True, clip_obs=10.0). Pickled alongside the policy zip — both required at inference.
  • Callbacks:
    • CheckpointCallback every 100 k steps.
    • EvalCallback on a separate eval env (no normalization-update) every 50 k steps; logs success rate and time-to-pen to TensorBoard.
    • Custom CurriculumCallback: bumps n_sheep when eval success rate crosses the stage threshold for 3 consecutive evals.
  • Determinism for debugging: seed-pinned eval env so regressions are catchable.

6. Webots integration — RL inference path

controllers/shepherd_dog/shepherd_dog.py becomes a thin wrapper:

MODE = os.environ.get("HERDING_MODE", "rl")  # "rl" | "strombom"

if MODE == "rl":
    policy = policy_loader.load("training/runs/best/policy.zip",
                                "training/runs/best/vecnormalize.pkl")
    obs_fn = build_obs   # from herding/obs.py
else:
    obs_fn = None        # strombom path uses sheep_positions directly

while robot.step(timestep) != -1:
    receive_messages()
    if MODE == "rl":
        obs = obs_fn(dog_xy, dog_heading, sheep_positions, ...)
        action, _ = policy.predict(obs, deterministic=True)
        vx, vy = action.tolist()
    else:
        vx, vy, mode, dbg = compute_action_debug(dog_xy, sheep_positions, PEN_ENTRY)
        # plus existing rescue/cooldown/EMA layer
    drive_action(vx, vy, ...)

A safety supervisor wraps the RL output: if obs indicates the dog is < 0.6 m from a wall, override with the existing wall-escape behavior (reverse + turn). This is a hard guarantee diff-drive needs because PPO may not discover wall-escape reliably from on-policy data.

policy_loader.py handles the SB3 import lazily so the controller still works with MODE=strombom even if SB3 is not installed in the Webots Python environment.


7. Optional extensions (post-baseline)

  • External pen (Section 2.1 option B): edit field.wbt to extend the south wall hole into an external L-shaped pen with its own walls; update herding/geometry.py; retrain stage 3 only.
  • Lidar observation: replace symbolic obs with 36-bin downsampled lidar + ego state; train end-to-end. Useful as the "extra merit" dimension in the project doc.
  • Two-dog mode: make env multi-agent, train with MAPPO-style shared critic or independent PPO. The proto already supports multiple dog instances; world only needs a second ShepherdDog node.
  • Mecanum comparison: swap the dog proto for a mecanum variant; same policy, different _integrate_diffdrive (becomes holonomic).
  • Sheep flock size scaling: 5, 10, 20 — the obs is order-invariant so the same policy generalises; just curriculum further.

8. Risks & mitigations

Risk Mitigation
Sim-to-Webots gap (sheep dynamics, wall friction) parity_test.py asserts trajectory match within tolerance for fixed seeds; if it fails, fix the env, not the policy
Dog learns to wall-pin sheep against fence Add w_collide penalty + min-sheep-to-wall term in obs; curriculum from 1 sheep first
PPO oscillation collapses into spinning Action smoothing in env step (EMA on (vx, vy), mirroring ACTION_SMOOTH=0.35 from Strömbom controller); reward small ‖a_t a_{t-1}‖ penalty
Pen approach failures (sheep refuse gate) Reuse the existing deadzone_rescue as a scripted fallback triggered when a sheep has been deadzoned > 200 steps — RL handles the common case, scripted handles the corner
Gym version mismatch (gymnasium vs gym) Lock to gymnasium>=0.29, stable-baselines3>=2.3 in requirements

9. Milestones (suggested order of implementation)

  1. M0 — Refactor (no behavior change): create herding/ package, move constants out of flocking.py/strombom.py, leave shims; verify Webots still runs Strömbom unchanged. Add dog_behavior_log.csv to .gitignore.
  2. M1 — Env & parity: herding_env.py, parity_test.py. Asserts sheep + dog trajectories match Webots within tolerance for 5 fixed seeds. Done when parity test green.
  3. M2 — PPO baseline: train Stage 0 (1 sheep) for 0.5 M steps; eval in env at ≥ 90 % success.
  4. M3 — Webots inference: load Stage 0 policy in shepherd_dog.py with HERDING_MODE=rl; verify the dog herds 1 sheep into the pen in the actual Webots world. This is the sim-to-sim transfer gate.
  5. M4 — Curriculum: stages 13, ~5 M steps total, with checkpoints and eval logs.
  6. M5 — Strömbom comparison: run both controllers on a fixed eval suite (same seeds, 1/2/3 sheep), log success rate and time-to-pen. This is a deliverable for the project's "quantitative evaluation" goal.
  7. M6 — Documentation: a short README in training/ showing how to train, evaluate, and switch modes in Webots.

Each milestone is independently demoable. M0M3 is the critical path to "RL works in Webots"; M4M6 polishes it for the project deliverable.


10. Decisions (locked in by implementation)

  • Pen layout: option B (external pen). The pen sits south of the field at x ∈ [10, 13], y ∈ [-22, -15] and is reached through the existing 3 m gap in the south stone wall. The old in-field quarantine fence is gone and the wooden gate is modeled as swung-open and parked on the west gate post so the corridor is unobstructed. This kills the deadzone class entirely.
  • Flock size: 1..10 sheep, sampled uniformly each reset. The order- invariant observation (CoM, dispersion, polar histogram) lets a single policy generalise across the whole range. A curriculum widens max_n_sheep from 1 to 10 over training to keep early exploration tractable.
  • Single-sheep mode: handled by the same policy (n_sheep=1 is the first stage of the curriculum and stays in the training distribution throughout). No separate model.
  • Hardware: GPU for training. SubprocVecEnv × 16 on CPU feeds an MlpPolicy on GPU; ~23 h for the full curriculum.

11. What was built

herding/                    # single source of truth, importable from both
  geometry.py               # field/pen constants, latch helpers, robot specs
  flocking_sim.py           # Reynolds boid step (matches Webots controller)
  diffdrive.py              # diff-drive kinematics + velocity↔wheels
  obs.py                    # 28-D order-invariant observation builder
  strombom.py               # collect/drive heuristic (baseline + fallback)

worlds/field.wbt            # external pen south of field, 10 sheep slots,
                            # gate parked open, in-field fence removed

controllers/sheep/sheep.py            # imports from herding/, latches on
                                      # is_penned_position
controllers/shepherd_dog/
  shepherd_dog.py           # mode switch (HERDING_MODE=rl|strombom),
                            # safety supervisor for DOG_SOUTH_LIMIT
  policy_loader.py          # lazy SB3 zip + VecNormalize loader
  strombom.py               # shim re-exporting herding.strombom

training/
  herding_env.py            # gymnasium.Env, action smoothing, reward shaping
  train_ppo.py              # SB3 PPO with VecNormalize, eval, checkpoints,
                            # curriculum callback
  eval.py                   # success-rate / time-to-pen across n_sheep
  parity_test.py            # shape, determinism, baseline-rollout smoke test
  configs/ppo_default.yaml
  requirements.txt
  README.md                 # how to train, evaluate, switch modes in Webots

12. To run

# 1. Install deps (CUDA-enabled torch wheel for GPU)
pip install -r training/requirements.txt

# 2. Smoke test
python -m training.parity_test

# 3. Train (5 M steps, ~23 h on a single GPU)
python -m training.train_ppo --out-dir training/runs/baseline

# 4. Evaluate vs Strömbom
python -m training.eval --policy training/runs/baseline/best
python -m training.eval --policy strombom

# 5. Run in Webots
export HERDING_MODE=rl
export HERDING_POLICY_DIR=$PWD/training/runs/baseline/best
webots worlds/field.wbt