Webots sim-to-real fixes, DAgger pipeline, 360° proto variant
Today's session worked across the full Webots delivery stack — found and
fixed a cluster of bugs blocking the BC/RL transfer, then explored
training-side mitigations for the residual perception gap.
Bug fixes:
- Makefile FP_RATE default 2.0 → 0.0: BC demos used fp_rate=0 but RL
fine-tune defaulted to fp_rate=2, poisoning the BC obs distribution
and stalling PPO at 0% success across 1.46M+ steps.
- controllers/{shepherd_dog,sheep}/runtime.ini: Webots was launching
controllers under system python3 (no numpy) and they were crashing
silently. Pinned to the conda tir env.
- herding/config.py HERDING_WEBOTS preset: pen_latch_depth 0.5 → 2.0,
max_new_tracks_per_step 3 → 1, static_reject 0.8 → 1.2. Stops phantom
FPs near the gate from latching as permanently-penned tracks.
- herding/perception/sheep_tracker.py: penned tracks now decay at
forget_steps × 8 instead of living forever. Adds get_positions
min_freshness filter for deploy-time use.
Training/eval matches deployment:
- training/bc/collect.py: --dagger-policy flag for DAgger rollouts
(policy drives, teacher labels) + --use-webots-preset for matched
140° tracker + DR config.
- controllers/shepherd_dog/shepherd_dog.py: scan-fallback (0, 0.6) when
BC/RL sees empty sheep_positions — recovers from FOV gaps.
Tooling:
- tools/dagger_round.sh: one-shot DAgger round (collect + concat + bc).
- tools/webots_sweep_gt.sh: full sweep with HERDING_USE_GT=1 for the
perception-gap diagnosis matrix.
- protos/ShepherdDog360.proto: 360° FOV variant for the FOV-ablation
comparison. Canonical proto stays at 140° per project spec.
Artifacts: v1 BC/RL policies for all 4 (drive × world) combos trained
in clean gym (success: diff/field 90-100%, diff/round 58%, mec/field
60-100%, mec/round 50-100%). DAgger r1/r2 BCs for diff/field show
12%→38% progression on gym HERDING_WEBOTS proxy but did not close
to actual Webots LiDAR (0/5 throughout). Next: LSTM policy or
learned tracker per the project-state memory.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -51,25 +51,57 @@ RL_FAST_POLICY = $(RL_FAST_DIR)/policy.zip
|
||||
|
||||
# --- Demo collection ---
|
||||
TEACHER ?= universal
|
||||
# Round field is fundamentally harder (narrow gate at south of a circle).
|
||||
# Default to more demos there to give BC a fair shot at 60%+.
|
||||
# Mecanum has more complex dynamics and a weaker teacher imitation signal
|
||||
# (val_cos ≈ 0.70 vs ≥ 0.88 for differential). Give it more demos and
|
||||
# longer BC training to compensate.
|
||||
ifeq ($(DRIVE),mecanum)
|
||||
ifeq ($(WORLD),field_round)
|
||||
SEEDS_PER_N ?= 80
|
||||
else
|
||||
SEEDS_PER_N ?= 50
|
||||
endif
|
||||
else
|
||||
# Round field is harder; more demos give BC a fair shot at 60%+.
|
||||
ifeq ($(WORLD),field_round)
|
||||
SEEDS_PER_N ?= 60
|
||||
else
|
||||
SEEDS_PER_N ?= 25
|
||||
endif
|
||||
endif
|
||||
SUBSAMPLE ?= 3
|
||||
FRAME_STACK ?= 4
|
||||
DEMO_MAX_STEPS ?= 100000
|
||||
|
||||
# --- Behaviour cloning ---
|
||||
ifeq ($(DRIVE),mecanum)
|
||||
ifeq ($(WORLD),field_round)
|
||||
BC_EPOCHS ?= 200
|
||||
else
|
||||
BC_EPOCHS ?= 100
|
||||
endif
|
||||
else
|
||||
ifeq ($(WORLD),field_round)
|
||||
BC_EPOCHS ?= 150
|
||||
else
|
||||
BC_EPOCHS ?= 60
|
||||
endif
|
||||
endif
|
||||
BC_NET_ARCH ?= 512,512
|
||||
|
||||
# --- Domain randomisation (used by bc_demos and rl targets) ---
|
||||
# FP_RATE: mean false-positive detections injected per step (Poisson λ).
|
||||
# ACTION_SMOOTH_TRAIN: EMA on actions to match Webots controller (0.55).
|
||||
# WHEEL_SLIP_STD: Gaussian wheel-speed noise for mecanum dynamics gap.
|
||||
#
|
||||
# FP_RATE is used consistently in BC demos *and* RL: BC collection runs
|
||||
# in PRIVILEGED mode (teacher sees GT; student obs sees the FP-injected
|
||||
# tracker output), so the policy learns to denoise to the GT signal.
|
||||
# Mismatched FP_RATE between BC/RL was the root cause of an earlier
|
||||
# regression (BC=0, RL=2 → PPO stalled at 0% success).
|
||||
FP_RATE ?= 0.0
|
||||
ACTION_SMOOTH_TRAIN ?= 0.55
|
||||
WHEEL_SLIP_STD ?= 0.05
|
||||
|
||||
# --- KL-PPO fine-tune ---
|
||||
# Round field: longer training, looser KL, no time penalty (success
|
||||
# must be learned before speed is rewarded).
|
||||
@@ -93,9 +125,16 @@ DIFFICULTY ?= 1.0
|
||||
# --- Stage-2 "speed pass" (rl_fast) ---
|
||||
# Continues from RL_DIR with a negative TIME_W. Tighter KL keeps the
|
||||
# policy near the Stage-1 success rate while step-count drops.
|
||||
# Differential and mecanum respond differently: mecanum needs a stronger
|
||||
# time penalty to achieve speed gains; differential only needs a light
|
||||
# touch (-0.02) — stronger penalties trade success for speed without gain.
|
||||
RL_FAST_STEPS ?= 1000000
|
||||
RL_FAST_KL ?= 0.05
|
||||
ifeq ($(DRIVE),mecanum)
|
||||
RL_FAST_TIME_W ?= -0.05
|
||||
else
|
||||
RL_FAST_TIME_W ?= -0.02
|
||||
endif
|
||||
|
||||
# --- Evaluation ---
|
||||
EVAL_SEEDS ?= 10
|
||||
@@ -107,7 +146,7 @@ MODE ?= rl
|
||||
|
||||
|
||||
.PHONY: all bc_demos bc rl rl_fast eval eval_fast eval_all eval_all_fast \
|
||||
test webots clean clean_all help \
|
||||
test webots webots_sweep clean clean_all help \
|
||||
train_all train_diff_rect train_diff_round \
|
||||
train_mec_rect train_mec_round \
|
||||
train_all_fast train_diff_rect_fast train_diff_round_fast \
|
||||
@@ -129,7 +168,10 @@ $(BC_DEMOS):
|
||||
--seeds-per-n $(SEEDS_PER_N) --subsample $(SUBSAMPLE) \
|
||||
--frame-stack $(FRAME_STACK) --drive-mode $(DRIVE) \
|
||||
--world $(WORLD) \
|
||||
--max-steps $(DEMO_MAX_STEPS)
|
||||
--max-steps $(DEMO_MAX_STEPS) \
|
||||
--fp-rate $(FP_RATE) \
|
||||
--action-smooth $(ACTION_SMOOTH_TRAIN) \
|
||||
--wheel-slip-std $(WHEEL_SLIP_STD)
|
||||
|
||||
bc: $(BC_POLICY)
|
||||
$(BC_POLICY): $(BC_DEMOS)
|
||||
@@ -144,7 +186,10 @@ $(RL_POLICY): $(BC_POLICY)
|
||||
--total-timesteps $(PPO_STEPS) --kl-coef $(KL) \
|
||||
--imitate-weight $(IMITATE) --time-weight $(TIME_W) \
|
||||
--difficulty $(DIFFICULTY) \
|
||||
--drive-mode $(DRIVE) --world $(WORLD)
|
||||
--drive-mode $(DRIVE) --world $(WORLD) \
|
||||
--fp-rate $(FP_RATE) \
|
||||
--action-smooth $(ACTION_SMOOTH_TRAIN) \
|
||||
--wheel-slip-std $(WHEEL_SLIP_STD)
|
||||
|
||||
eval: $(RL_POLICY)
|
||||
$(PY) -m training.eval --policy $(RL_DIR) \
|
||||
@@ -162,7 +207,10 @@ $(RL_FAST_POLICY): $(RL_POLICY)
|
||||
--total-timesteps $(RL_FAST_STEPS) --kl-coef $(RL_FAST_KL) \
|
||||
--imitate-weight $(IMITATE) --time-weight $(RL_FAST_TIME_W) \
|
||||
--difficulty $(DIFFICULTY) \
|
||||
--drive-mode $(DRIVE) --world $(WORLD)
|
||||
--drive-mode $(DRIVE) --world $(WORLD) \
|
||||
--fp-rate $(FP_RATE) \
|
||||
--action-smooth $(ACTION_SMOOTH_TRAIN) \
|
||||
--wheel-slip-std $(WHEEL_SLIP_STD)
|
||||
|
||||
eval_fast: $(RL_FAST_POLICY)
|
||||
$(PY) -m training.eval --policy $(RL_FAST_DIR) \
|
||||
@@ -175,6 +223,14 @@ test:
|
||||
webots:
|
||||
tools/run_webots.sh $(N) $(MODE) $(DRIVE) $(WORLD)
|
||||
|
||||
# Headless sweep across all modes × worlds × flock sizes.
|
||||
# Results are written to webots_sweep.log.
|
||||
# Set USE_GT=1 to bypass LiDAR tracker (isolate perception from policy).
|
||||
webots_sweep:
|
||||
env $(if $(USE_GT),HERDING_USE_GT=1,) \
|
||||
PATH="$(CONDA_PREFIX)/bin:$(PATH)" \
|
||||
bash tools/webots_sweep.sh webots_sweep.log
|
||||
|
||||
clean:
|
||||
rm -f $(BC_DEMOS)
|
||||
rm -rf $(BC_DIR) $(RL_DIR)
|
||||
|
||||
Reference in New Issue
Block a user