TIR_PROJ/README.md

# Autonomous Shepherd-Dog Herding (Webots + RL)

Group G25 — *Diogo Costa, Johnny Fernandes, Nelson Neto*

A differential-drive shepherd dog that herds 1–10 sheep through a 3 m
gate into an external pen. The dog has three deployable modes:

| Mode | Source | Role |
|---|---|---|
| `strombom` | Strömbom et al. (2014) collect/drive heuristic | Analytic baseline |
| `bc` | Behaviour cloning of the Strömbom teacher | Imitation learning result |
| `rl` | KL-regularised PPO fine-tune of `bc` | Reward-driven refinement |

`sequential` (single-target pin-and-push) is kept as an alternative
analytic baseline.

## Perception

The dog perceives sheep **only through its front-mounted 140° LiDAR**
(180 rays, 12 m max range — see `protos/ShepherdDog.proto`). Each
control step:

1. Read `lidar.getRangeImage()`,
2. Cluster returns into world-frame `(x, y)` estimates
   (`herding/lidar_perception.py`),
3. Fold them into a multi-target tracker that maintains last-seen
   positions for sheep currently outside the FOV
   (`herding/sheep_tracker.py`).

**LiDAR validation** (intermediate-goal item v from `docs/project.md`):
during development a diagnostic-dump controller captured 80 real
Webots scans plus the ground-truth sheep positions. Comparing
detections against GT showed clustered centroids match GT positions
within 0.15 m after the +SHEEP_RADIUS surface-to-centre correction —
i.e. the LiDAR pipeline produces correct sheep-position estimates
from the real Webots scan, validating the sensor for the herding
task.

The tracker outputs a `{name: (x, y)}` dict shaped exactly like the
prior receiver-based one, so Strömbom, Sequential, and the BC obs
builder all run unchanged on top of it. The 2D Gymnasium env
(`herding/lidar_sim.py`) raycasts sheep discs at training time, so
demos collected in the env match the perception the deployed
controller sees in Webots.

Privileged ground-truth perception is available for ablation —
`HerdingEnv(use_lidar=False)`.

## Quick start

```bash
# 1. Set up the Python env (any venv with PyTorch + SB3)
pip install -r training/requirements.txt

# 2. Smoke test
python -m tests.parity_test

# 3. Reproduce the BC policy (~10 min on CPU: ~5 min demos + ~3 min BC)
python -m tools.collect_demos --teacher strombom \
    --out training/demos.npz --seeds-per-n 15 --subsample 3 --frame-stack 4
python -m training.bc_pretrain --demos training/demos.npz \
    --out training/runs/bc --epochs 60 --net-arch 512,512

# 4. KL-PPO fine-tune of the BC policy (~30 min on CPU, 1 M steps)
python -m training.train_ppo \
    --bc training/runs/bc \
    --out training/runs/rl \
    --total-timesteps 1000000

# 5. Evaluate (env)
python -m training.eval --policy training/runs/rl \
    --max-flock 10 --max-steps 15000 --n-seeds 10

# 6. Run in Webots
tools/run_webots.sh 10 bc          # behaviour-cloned MLP
tools/run_webots.sh 10 rl          # KL-PPO fine-tune
tools/run_webots.sh 10 strombom    # analytic baseline
```

## Layout

```
herding/                  — perception / control / world primitives
  obs.py                  — 32-D order-invariant observation builder
  world/                  — environment-side physics & geometry
    geometry.py             field/pen constants, robot specs
    diffdrive.py            differential-drive kinematics
    flocking_sim.py         Reynolds + Strömbom 2014 sheep dynamics
  perception/             — LiDAR → tracked-sheep pipeline
    lidar_sim.py            fast 2D raycast for the env
    lidar_perception.py     scan → world-frame cluster centroids + filters
    sheep_tracker.py        multi-target NN tracker with FOV memory
  control/                — every dog mode's action source
    strombom.py             canonical CoM collect/drive heuristic
    sequential.py           single-target "pin-and-push" alternative
    active_scan.py          wraps a base teacher with opening rotation +
                            walk-to-centre fallback
    modulation.py           shared near-sheep speed-modulation helper

controllers/
  sheep/sheep.py          — Webots sheep controller (uses herding.world.flocking_sim)
  shepherd_dog/
    shepherd_dog.py       — Webots dog controller, mode-switched
    policy_loader.py      — lazy SB3 policy loader (auto-detects frame stack)

training/
  herding_env.py          — Gymnasium env (LiDAR + tracker by default)
  bc_pretrain.py          — supervised BC of (obs, action) demos into MLP
  train_ppo.py            — KL-regularised PPO fine-tune of BC
  eval.py                 — analytic + learned policy comparison harness
  runs/                   — checkpoints (whitelisted in .gitignore)
  requirements.txt

tests/
  parity_test.py          — shape / determinism / baseline smoke test

tools/
  collect_demos.py        — sim demos via the active-scan teacher
  run_webots.sh           — launch Webots with N sheep + chosen mode

worlds/
  field.wbt               — main world (3 m gate, external pen)

protos/                   — Sheep / ShepherdDog robot definitions
docs/project.md           — original project goals
```

## Shared low-level control

Every dog mode (Strömbom, Sequential, BC, RL) routes its action
through `herding/control/modulation.py:modulate_speed_near_sheep`,
which scales action magnitude down when within ~2.5 m of the nearest
tracked sheep. This stops the dog from charging in at full speed and
scattering the flock. Direction (intent) is preserved.

All modes also share the same EMA action smoother in
`controllers/shepherd_dog/shepherd_dog.py:ACTION_SMOOTH = 0.55`.

## Results — env eval, 10 seeds × n=1..10

`max_steps=15000`, full-field spawn distribution. Success rate per
flock size, then mean steps over successful seeds.

### Success rate (%)

| n  | Strömbom | `bc` | `rl` |
|---:|---:|---:|---:|
|  1 |  30 |  80 | **90** |
|  2 |  90 |  50 | **90** |
|  3 |  60 |  90 | **90** |
|  4 |  40 |  80 | **90** |
|  5 |  60 |  70 | **100** |
|  6 |  30 |  80 | 80 |
|  7 |  70 |  80 | **100** |
|  8 |  30 | 100 | **100** |
|  9 |  40 |  90 | **100** |
| 10 |  50 | 100 | **100** |

### Mean penned per episode (out of n)

| n  | Strömbom | `bc` | `rl` |
|---:|---:|---:|---:|
|  1 | 0.30 | 0.80 | **0.90** |
|  5 | 3.90 | 4.10 | **5.00** |
|  8 | 4.20 | 8.00 | **8.00** |
| 10 | 7.40 | 10.00 | **10.00** |

### Takeaways

- **BC clearly beats Strömbom** under realistic LiDAR conditions (full
  field, partial observability). Strömbom struggles on small flocks
  where a single sheep can spawn beyond the LiDAR's 12 m range; BC
  learned active perception from the demos.
- **RL refines BC** without regressing on any cell. Ties or beats BC
  at every flock size; biggest gains at n=1 and n=4 where BC's
  imitation of Strömbom's drive heuristic was sub-optimal.
- **Aggressive reward shaping doesn't help** — a more aggressive
  variant (β=0.02, W_TIME=-0.1, W_IMITATE=0, 3 M steps) trained as
  an ablation was strictly worse than the conservative tune shipped
  here (β=0.05, W_IMITATE=0.5, 1 M steps).

## License

Educational project for the *Topics in Intelligent Robotics* course.