Using cpu device Logging to runs/ppo_fix_check/ppo_1 ------------------------------ | time/ | | | fps | 5021 | | iterations | 1 | | time_elapsed | 3 | | total_timesteps | 16384 | ------------------------------ ------------------------------------------ | time/ | | | fps | 4241 | | iterations | 2 | | time_elapsed | 7 | | total_timesteps | 32768 | | train/ | | | approx_kl | 0.0047510993 | | clip_fraction | 0.0344 | | clip_range | 0.2 | | entropy_loss | -2.85 | | explained_variance | 0.786 | | learning_rate | 0.0003 | | loss | -0.00995 | | n_updates | 10 | | policy_gradient_loss | -0.00156 | | std | 1.01 | | value_loss | 0.0657 | ------------------------------------------ ------------------------------------------ | time/ | | | fps | 4026 | | iterations | 3 | | time_elapsed | 12 | | total_timesteps | 49152 | | train/ | | | approx_kl | 0.0032065492 | | clip_fraction | 0.0328 | | clip_range | 0.2 | | entropy_loss | -2.88 | | explained_variance | 0.868 | | learning_rate | 0.0003 | | loss | -0.0327 | | n_updates | 20 | | policy_gradient_loss | -0.00152 | | std | 1.02 | | value_loss | 0.0172 | ------------------------------------------ /home/jalf/miniconda3/envs/tir/lib/python3.12/site-packages/stable_baselines3/common/evaluation.py:71: UserWarning: Evaluation environment is not wrapped with a ``Monitor`` wrapper. This may result in reporting modified episode lengths and rewards, if other wrappers happen to modify these. Consider wrapping environment first with ``Monitor`` wrapper. warnings.warn( Eval num_timesteps=50000, episode_reward=-25.33 +/- 56.30 Episode length: 1859.00 +/- 393.69 ------------------------------------------ | eval/ | | | mean_ep_length | 1.86e+03 | | mean_reward | -25.3 | | time/ | | | total_timesteps | 50000 | | train/ | | | approx_kl | 0.0038272792 | | clip_fraction | 0.0312 | | clip_range | 0.2 | | entropy_loss | -2.89 | | explained_variance | 0.891 | | learning_rate | 0.0003 | | loss | -0.0224 | | n_updates | 30 | | policy_gradient_loss | -0.0019 | | std | 1.02 | | value_loss | 0.0227 | ------------------------------------------ New best mean reward! ------------------------------ | time/ | | | fps | 2387 | | iterations | 4 | | time_elapsed | 27 | | total_timesteps | 65536 | ------------------------------ ------------------------------------------ | time/ | | | fps | 2563 | | iterations | 5 | | time_elapsed | 31 | | total_timesteps | 81920 | | train/ | | | approx_kl | 0.0040233894 | | clip_fraction | 0.0323 | | clip_range | 0.2 | | entropy_loss | -2.87 | | explained_variance | 0.878 | | learning_rate | 0.0003 | | loss | -0.0251 | | n_updates | 40 | | policy_gradient_loss | -0.00247 | | std | 1.01 | | value_loss | 0.0169 | ------------------------------------------ ----------------------------------------- | time/ | | | fps | 2719 | | iterations | 6 | | time_elapsed | 36 | | total_timesteps | 98304 | | train/ | | | approx_kl | 0.003573698 | | clip_fraction | 0.0316 | | clip_range | 0.2 | | entropy_loss | -2.86 | | explained_variance | 0.865 | | learning_rate | 0.0003 | | loss | -0.0219 | | n_updates | 50 | | policy_gradient_loss | -0.0019 | | std | 1.01 | | value_loss | 0.022 | ----------------------------------------- /home/jalf/miniconda3/envs/tir/lib/python3.12/site-packages/stable_baselines3/common/evaluation.py:71: UserWarning: Evaluation environment is not wrapped with a ``Monitor`` wrapper. This may result in reporting modified episode lengths and rewards, if other wrappers happen to modify these. Consider wrapping environment first with ``Monitor`` wrapper. warnings.warn( Eval num_timesteps=100000, episode_reward=-29.60 +/- 36.59 Episode length: 1939.35 +/- 264.37 ------------------------------------------ | eval/ | | | mean_ep_length | 1.94e+03 | | mean_reward | -29.6 | | time/ | | | total_timesteps | 100000 | | train/ | | | approx_kl | 0.0046861977 | | clip_fraction | 0.039 | | clip_range | 0.2 | | entropy_loss | -2.86 | | explained_variance | 0.815 | | learning_rate | 0.0003 | | loss | -0.0257 | | n_updates | 60 | | policy_gradient_loss | -0.00203 | | std | 1.01 | | value_loss | 0.0201 | ------------------------------------------ ------------------------------- | time/ | | | fps | 2191 | | iterations | 7 | | time_elapsed | 52 | | total_timesteps | 114688 | ------------------------------- ----------------------------------------- | time/ | | | fps | 2314 | | iterations | 8 | | time_elapsed | 56 | | total_timesteps | 131072 | | train/ | | | approx_kl | 0.005258695 | | clip_fraction | 0.0503 | | clip_range | 0.2 | | entropy_loss | -2.86 | | explained_variance | 0.807 | | learning_rate | 0.0003 | | loss | -0.0211 | | n_updates | 70 | | policy_gradient_loss | -0.00398 | | std | 1.01 | | value_loss | 0.0164 | ----------------------------------------- ------------------------------------------ | time/ | | | fps | 2359 | | iterations | 9 | | time_elapsed | 62 | | total_timesteps | 147456 | | train/ | | | approx_kl | 0.0043328116 | | clip_fraction | 0.0332 | | clip_range | 0.2 | | entropy_loss | -2.86 | | explained_variance | 0.811 | | learning_rate | 0.0003 | | loss | -0.0259 | | n_updates | 80 | | policy_gradient_loss | -0.00173 | | std | 1.01 | | value_loss | 0.0121 | ------------------------------------------ Eval num_timesteps=150000, episode_reward=-33.97 +/- 37.15 Episode length: 1954.85 +/- 196.80 ----------------------------------------- | eval/ | | | mean_ep_length | 1.95e+03 | | mean_reward | -34 | | time/ | | | total_timesteps | 150000 | | train/ | | | approx_kl | 0.005169191 | | clip_fraction | 0.0506 | | clip_range | 0.2 | | entropy_loss | -2.85 | | explained_variance | 0.649 | | learning_rate | 0.0003 | | loss | -0.0287 | | n_updates | 90 | | policy_gradient_loss | -0.00384 | | std | 1 | | value_loss | 0.0162 | ----------------------------------------- [Diag @ 150,000 | n_sheep=1 | success=15%] COMPACT_CANT_DRIVE 16/20 SUCCESS 3/20 DROVE_NO_SHEEP 1/20 action_mag mean=0.239 p10=0.071 p90=0.433 (0=stopped, 1=full speed) min_flock_radius mean=0.00m best=0.00m (target <5m to compact) min_dog_to_com mean=4.80m best=1.70m (FLEE_DIST=7m) min_com_to_pen mean=10.22m best=1.50m reward/step (mean): progress=+0.0013 alignment=+0.0000 pen_bonus=+0.0008 step_cost=-0.0200 complete=+0.0078 ------------------------------- | time/ | | | fps | 1935 | | iterations | 10 | | time_elapsed | 84 | | total_timesteps | 163840 | ------------------------------- ------------------------------------------ | time/ | | | fps | 2014 | | iterations | 11 | | time_elapsed | 89 | | total_timesteps | 180224 | | train/ | | | approx_kl | 0.0039950563 | | clip_fraction | 0.0276 | | clip_range | 0.2 | | entropy_loss | -2.83 | | explained_variance | 0.623 | | learning_rate | 0.0003 | | loss | -0.0128 | | n_updates | 100 | | policy_gradient_loss | -0.00208 | | std | 0.995 | | value_loss | 0.0959 | ------------------------------------------ ------------------------------------------ | time/ | | | fps | 2093 | | iterations | 12 | | time_elapsed | 93 | | total_timesteps | 196608 | | train/ | | | approx_kl | 0.0036244316 | | clip_fraction | 0.0299 | | clip_range | 0.2 | | entropy_loss | -2.83 | | explained_variance | 0.916 | | learning_rate | 0.0003 | | loss | -0.0251 | | n_updates | 110 | | policy_gradient_loss | -0.00229 | | std | 0.991 | | value_loss | 0.0118 | ------------------------------------------ Eval num_timesteps=200000, episode_reward=-36.37 +/- 39.41 Episode length: 1950.95 +/- 213.80 ----------------------------------------- | eval/ | | | mean_ep_length | 1.95e+03 | | mean_reward | -36.4 | | time/ | | | total_timesteps | 200000 | | train/ | | | approx_kl | 0.003325508 | | clip_fraction | 0.0223 | | clip_range | 0.2 | | entropy_loss | -2.83 | | explained_variance | 0.858 | | learning_rate | 0.0003 | | loss | -0.0279 | | n_updates | 120 | | policy_gradient_loss | -0.0007 | | std | 0.999 | | value_loss | 0.0493 | ----------------------------------------- ------------------------------- | time/ | | | fps | 1964 | | iterations | 13 | | time_elapsed | 108 | | total_timesteps | 212992 | ------------------------------- ----------------------------------------- | time/ | | | fps | 2034 | | iterations | 14 | | time_elapsed | 112 | | total_timesteps | 229376 | | train/ | | | approx_kl | 0.004660043 | | clip_fraction | 0.0403 | | clip_range | 0.2 | | entropy_loss | -2.85 | | explained_variance | 0.719 | | learning_rate | 0.0003 | | loss | 0.128 | | n_updates | 130 | | policy_gradient_loss | -0.00265 | | std | 1.01 | | value_loss | 0.073 | ----------------------------------------- ---------------------------------------- | time/ | | | fps | 2103 | | iterations | 15 | | time_elapsed | 116 | | total_timesteps | 245760 | | train/ | | | approx_kl | 0.00501227 | | clip_fraction | 0.0499 | | clip_range | 0.2 | | entropy_loss | -2.88 | | explained_variance | 0.847 | | learning_rate | 0.0003 | | loss | -0.0237 | | n_updates | 140 | | policy_gradient_loss | -0.00264 | | std | 1.02 | | value_loss | 0.0415 | ---------------------------------------- Eval num_timesteps=250000, episode_reward=-44.92 +/- 15.63 Episode length: 2000.00 +/- 0.00 ------------------------------------------ | eval/ | | | mean_ep_length | 2e+03 | | mean_reward | -44.9 | | time/ | | | total_timesteps | 250000 | | train/ | | | approx_kl | 0.0055294414 | | clip_fraction | 0.06 | | clip_range | 0.2 | | entropy_loss | -2.89 | | explained_variance | 0.951 | | learning_rate | 0.0003 | | loss | -0.0274 | | n_updates | 150 | | policy_gradient_loss | -0.00491 | | std | 1.03 | | value_loss | 0.014 | ------------------------------------------ ------------------------------- | time/ | | | fps | 1999 | | iterations | 16 | | time_elapsed | 131 | | total_timesteps | 262144 | ------------------------------- ------------------------------------------ | time/ | | | fps | 2051 | | iterations | 17 | | time_elapsed | 135 | | total_timesteps | 278528 | | train/ | | | approx_kl | 0.0051201656 | | clip_fraction | 0.0301 | | clip_range | 0.2 | | entropy_loss | -2.88 | | explained_variance | 0.941 | | learning_rate | 0.0003 | | loss | 0.148 | | n_updates | 160 | | policy_gradient_loss | -0.00199 | | std | 1.02 | | value_loss | 0.099 | ------------------------------------------ ----------------------------------------- | time/ | | | fps | 2096 | | iterations | 18 | | time_elapsed | 140 | | total_timesteps | 294912 | | train/ | | | approx_kl | 0.004261789 | | clip_fraction | 0.0328 | | clip_range | 0.2 | | entropy_loss | -2.88 | | explained_variance | 0.942 | | learning_rate | 0.0003 | | loss | -0.0314 | | n_updates | 170 | | policy_gradient_loss | -0.00243 | | std | 1.02 | | value_loss | 0.0117 | ----------------------------------------- Eval num_timesteps=300000, episode_reward=-44.79 +/- 17.68 Episode length: 2000.00 +/- 0.00 ----------------------------------------- | eval/ | | | mean_ep_length | 2e+03 | | mean_reward | -44.8 | | time/ | | | total_timesteps | 300000 | | train/ | | | approx_kl | 0.004783842 | | clip_fraction | 0.0296 | | clip_range | 0.2 | | entropy_loss | -2.87 | | explained_variance | 0.892 | | learning_rate | 0.0003 | | loss | -0.0219 | | n_updates | 180 | | policy_gradient_loss | -0.00159 | | std | 1.01 | | value_loss | 0.0497 | ----------------------------------------- [Diag @ 300,000 | n_sheep=1 | success=0%] COMPACT_CANT_DRIVE 17/20 DROVE_NO_SHEEP 3/20 action_mag mean=0.241 p10=0.109 p90=0.389 (0=stopped, 1=full speed) min_flock_radius mean=0.00m best=0.00m (target <5m to compact) min_dog_to_com mean=4.77m best=2.12m (FLEE_DIST=7m) min_com_to_pen mean=9.31m best=1.50m reward/step (mean): progress=+0.0016 alignment=+0.0000 pen_bonus=+0.0000 step_cost=-0.0200 complete=+0.0000 ------------------------------- | time/ | | | fps | 1905 | | iterations | 19 | | time_elapsed | 163 | | total_timesteps | 311296 | ------------------------------- ------------------------------------------ | time/ | | | fps | 1949 | | iterations | 20 | | time_elapsed | 168 | | total_timesteps | 327680 | | train/ | | | approx_kl | 0.0033368056 | | clip_fraction | 0.0258 | | clip_range | 0.2 | | entropy_loss | -2.87 | | explained_variance | 0.794 | | learning_rate | 0.0003 | | loss | -0.0211 | | n_updates | 190 | | policy_gradient_loss | -0.00105 | | std | 1.02 | | value_loss | 0.0769 | ------------------------------------------ ------------------------------------------ | time/ | | | fps | 1992 | | iterations | 21 | | time_elapsed | 172 | | total_timesteps | 344064 | | train/ | | | approx_kl | 0.0046488494 | | clip_fraction | 0.0352 | | clip_range | 0.2 | | entropy_loss | -2.87 | | explained_variance | 0.927 | | learning_rate | 0.0003 | | loss | -0.0274 | | n_updates | 200 | | policy_gradient_loss | -0.00331 | | std | 1.02 | | value_loss | 0.0165 | ------------------------------------------ Eval num_timesteps=350000, episode_reward=-24.90 +/- 50.25 Episode length: 1976.75 +/- 82.03 ------------------------------------------ | eval/ | | | mean_ep_length | 1.98e+03 | | mean_reward | -24.9 | | time/ | | | total_timesteps | 350000 | | train/ | | | approx_kl | 0.0041725934 | | clip_fraction | 0.0299 | | clip_range | 0.2 | | entropy_loss | -2.88 | | explained_variance | 0.944 | | learning_rate | 0.0003 | | loss | -0.026 | | n_updates | 210 | | policy_gradient_loss | -0.0026 | | std | 1.02 | | value_loss | 0.00665 | ------------------------------------------ New best mean reward! ------------------------------- | time/ | | | fps | 1921 | | iterations | 22 | | time_elapsed | 187 | | total_timesteps | 360448 | ------------------------------- ----------------------------------------- | time/ | | | fps | 1963 | | iterations | 23 | | time_elapsed | 191 | | total_timesteps | 376832 | | train/ | | | approx_kl | 0.005180447 | | clip_fraction | 0.0532 | | clip_range | 0.2 | | entropy_loss | -2.87 | | explained_variance | 0.956 | | learning_rate | 0.0003 | | loss | -0.0255 | | n_updates | 220 | | policy_gradient_loss | -0.00352 | | std | 1.02 | | value_loss | 0.0142 | ----------------------------------------- ----------------------------------------- | time/ | | | fps | 1990 | | iterations | 24 | | time_elapsed | 197 | | total_timesteps | 393216 | | train/ | | | approx_kl | 0.004661506 | | clip_fraction | 0.0443 | | clip_range | 0.2 | | entropy_loss | -2.87 | | explained_variance | 0.967 | | learning_rate | 0.0003 | | loss | -0.0331 | | n_updates | 230 | | policy_gradient_loss | -0.00441 | | std | 1.02 | | value_loss | 0.0112 | ----------------------------------------- Eval num_timesteps=400000, episode_reward=-26.04 +/- 47.69 Episode length: 1890.85 +/- 367.20 ----------------------------------------- | eval/ | | | mean_ep_length | 1.89e+03 | | mean_reward | -26 | | time/ | | | total_timesteps | 400000 | | train/ | | | approx_kl | 0.005491742 | | clip_fraction | 0.0538 | | clip_range | 0.2 | | entropy_loss | -2.89 | | explained_variance | 0.941 | | learning_rate | 0.0003 | | loss | -0.042 | | n_updates | 240 | | policy_gradient_loss | -0.00297 | | std | 1.03 | | value_loss | 0.00877 | ----------------------------------------- ------------------------------- | time/ | | | fps | 1927 | | iterations | 25 | | time_elapsed | 212 | | total_timesteps | 409600 | ------------------------------- ------------------------------------------ | time/ | | | fps | 1966 | | iterations | 26 | | time_elapsed | 216 | | total_timesteps | 425984 | | train/ | | | approx_kl | 0.0045445506 | | clip_fraction | 0.0385 | | clip_range | 0.2 | | entropy_loss | -2.91 | | explained_variance | 0.941 | | learning_rate | 0.0003 | | loss | -0.0343 | | n_updates | 250 | | policy_gradient_loss | -0.00307 | | std | 1.04 | | value_loss | 0.00818 | ------------------------------------------ ------------------------------------------ | time/ | | | fps | 2004 | | iterations | 27 | | time_elapsed | 220 | | total_timesteps | 442368 | | train/ | | | approx_kl | 0.0045271795 | | clip_fraction | 0.0373 | | clip_range | 0.2 | | entropy_loss | -2.94 | | explained_variance | 0.97 | | learning_rate | 0.0003 | | loss | -0.0361 | | n_updates | 260 | | policy_gradient_loss | -0.00236 | | std | 1.05 | | value_loss | 0.0091 | ------------------------------------------ Eval num_timesteps=450000, episode_reward=-24.58 +/- 48.73 Episode length: 1907.85 +/- 276.46 ------------------------------------------ | eval/ | | | mean_ep_length | 1.91e+03 | | mean_reward | -24.6 | | time/ | | | total_timesteps | 450000 | | train/ | | | approx_kl | 0.0052676853 | | clip_fraction | 0.0498 | | clip_range | 0.2 | | entropy_loss | -2.96 | | explained_variance | 0.948 | | learning_rate | 0.0003 | | loss | -0.0261 | | n_updates | 270 | | policy_gradient_loss | -0.00236 | | std | 1.07 | | value_loss | 0.0286 | ------------------------------------------ New best mean reward! [Diag @ 450,000 | n_sheep=1 | success=5%] COMPACT_CANT_DRIVE 18/20 DROVE_NO_SHEEP 1/20 SUCCESS 1/20 action_mag mean=0.272 p10=0.139 p90=0.407 (0=stopped, 1=full speed) min_flock_radius mean=0.00m best=0.00m (target <5m to compact) min_dog_to_com mean=4.81m best=1.54m (FLEE_DIST=7m) min_com_to_pen mean=12.36m best=1.96m reward/step (mean): progress=+0.0012 alignment=+0.0000 pen_bonus=+0.0003 step_cost=-0.0200 complete=+0.0025 ------------------------------- | time/ | | | fps | 1893 | | iterations | 28 | | time_elapsed | 242 | | total_timesteps | 458752 | ------------------------------- ----------------------------------------- | time/ | | | fps | 1928 | | iterations | 29 | | time_elapsed | 246 | | total_timesteps | 475136 | | train/ | | | approx_kl | 0.004465497 | | clip_fraction | 0.0376 | | clip_range | 0.2 | | entropy_loss | -2.97 | | explained_variance | 0.948 | | learning_rate | 0.0003 | | loss | -0.0307 | | n_updates | 280 | | policy_gradient_loss | -0.00259 | | std | 1.07 | | value_loss | 0.0213 | ----------------------------------------- ------------------------------------------ | time/ | | | fps | 1961 | | iterations | 30 | | time_elapsed | 250 | | total_timesteps | 491520 | | train/ | | | approx_kl | 0.0054338034 | | clip_fraction | 0.0512 | | clip_range | 0.2 | | entropy_loss | -2.97 | | explained_variance | 0.967 | | learning_rate | 0.0003 | | loss | -0.021 | | n_updates | 290 | | policy_gradient_loss | -0.00296 | | std | 1.07 | | value_loss | 0.0138 | ------------------------------------------ Eval num_timesteps=500000, episode_reward=-44.13 +/- 20.75 Episode length: 2000.00 +/- 0.00 ----------------------------------------- | eval/ | | | mean_ep_length | 2e+03 | | mean_reward | -44.1 | | time/ | | | total_timesteps | 500000 | | train/ | | | approx_kl | 0.006292434 | | clip_fraction | 0.0572 | | clip_range | 0.2 | | entropy_loss | -2.97 | | explained_variance | 0.937 | | learning_rate | 0.0003 | | loss | -0.0398 | | n_updates | 300 | | policy_gradient_loss | -0.00516 | | std | 1.07 | | value_loss | 0.00832 | ----------------------------------------- ------------------------------- | time/ | | | fps | 1913 | | iterations | 31 | | time_elapsed | 265 | | total_timesteps | 507904 | ------------------------------- ------------------------------------------ | time/ | | | fps | 1940 | | iterations | 32 | | time_elapsed | 270 | | total_timesteps | 524288 | | train/ | | | approx_kl | 0.0063960385 | | clip_fraction | 0.0702 | | clip_range | 0.2 | | entropy_loss | -2.96 | | explained_variance | 0.942 | | learning_rate | 0.0003 | | loss | -0.0341 | | n_updates | 310 | | policy_gradient_loss | -0.00436 | | std | 1.06 | | value_loss | 0.0189 | ------------------------------------------ ------------------------------------------ | time/ | | | fps | 1968 | | iterations | 33 | | time_elapsed | 274 | | total_timesteps | 540672 | | train/ | | | approx_kl | 0.0070166546 | | clip_fraction | 0.0888 | | clip_range | 0.2 | | entropy_loss | -2.96 | | explained_variance | 0.955 | | learning_rate | 0.0003 | | loss | -0.0376 | | n_updates | 320 | | policy_gradient_loss | -0.00631 | | std | 1.06 | | value_loss | 0.00861 | ------------------------------------------ Eval num_timesteps=550000, episode_reward=-38.60 +/- 14.53 Episode length: 2000.00 +/- 0.00 ------------------------------------------ | eval/ | | | mean_ep_length | 2e+03 | | mean_reward | -38.6 | | time/ | | | total_timesteps | 550000 | | train/ | | | approx_kl | 0.0068266992 | | clip_fraction | 0.075 | | clip_range | 0.2 | | entropy_loss | -2.96 | | explained_variance | 0.959 | | learning_rate | 0.0003 | | loss | -0.0252 | | n_updates | 330 | | policy_gradient_loss | -0.00593 | | std | 1.07 | | value_loss | 0.0131 | ------------------------------------------ ------------------------------- | time/ | | | fps | 1922 | | iterations | 34 | | time_elapsed | 289 | | total_timesteps | 557056 | ------------------------------- ----------------------------------------- | time/ | | | fps | 1950 | | iterations | 35 | | time_elapsed | 294 | | total_timesteps | 573440 | | train/ | | | approx_kl | 0.006152669 | | clip_fraction | 0.0626 | | clip_range | 0.2 | | entropy_loss | -2.97 | | explained_variance | 0.954 | | learning_rate | 0.0003 | | loss | -0.0376 | | n_updates | 340 | | policy_gradient_loss | -0.00514 | | std | 1.07 | | value_loss | 0.0187 | ----------------------------------------- ----------------------------------------- | time/ | | | fps | 1977 | | iterations | 36 | | time_elapsed | 298 | | total_timesteps | 589824 | | train/ | | | approx_kl | 0.006685758 | | clip_fraction | 0.0729 | | clip_range | 0.2 | | entropy_loss | -2.97 | | explained_variance | 0.958 | | learning_rate | 0.0003 | | loss | -0.0387 | | n_updates | 350 | | policy_gradient_loss | -0.00632 | | std | 1.07 | | value_loss | 0.0118 | ----------------------------------------- Eval num_timesteps=600000, episode_reward=-31.39 +/- 8.94 Episode length: 2000.00 +/- 0.00 ----------------------------------------- | eval/ | | | mean_ep_length | 2e+03 | | mean_reward | -31.4 | | time/ | | | total_timesteps | 600000 | | train/ | | | approx_kl | 0.008094068 | | clip_fraction | 0.0985 | | clip_range | 0.2 | | entropy_loss | -2.97 | | explained_variance | 0.937 | | learning_rate | 0.0003 | | loss | -0.0439 | | n_updates | 360 | | policy_gradient_loss | -0.00782 | | std | 1.07 | | value_loss | 0.0116 | ----------------------------------------- [Diag @ 600,000 | n_sheep=1 | success=5%] COMPACT_CANT_DRIVE 16/20 DROVE_NO_SHEEP 3/20 SUCCESS 1/20 action_mag mean=0.150 p10=0.000 p90=0.392 (0=stopped, 1=full speed) min_flock_radius mean=0.00m best=0.00m (target <5m to compact) min_dog_to_com mean=3.64m best=0.68m (FLEE_DIST=7m) min_com_to_pen mean=10.60m best=1.50m reward/step (mean): progress=+0.0025 alignment=+0.0000 pen_bonus=+0.0003 step_cost=-0.0200 complete=+0.0026 [Curriculum] leaving stage n_sheep=1 after 600,000 steps | training success rate (last 100 eps) = 9% [Curriculum] → 2 sheep at step 600,000 ------------------------------- | time/ | | | fps | 1894 | | iterations | 37 | | time_elapsed | 319 | | total_timesteps | 606208 | ------------------------------- ------------------------------------------ | time/ | | | fps | 1917 | | iterations | 38 | | time_elapsed | 324 | | total_timesteps | 622592 | | train/ | | | approx_kl | 0.0067913756 | | clip_fraction | 0.0689 | | clip_range | 0.2 | | entropy_loss | -2.97 | | explained_variance | 0.861 | | learning_rate | 0.0003 | | loss | 0.0772 | | n_updates | 370 | | policy_gradient_loss | -0.00184 | | std | 1.07 | | value_loss | 0.101 | ------------------------------------------ ------------------------------------------ | time/ | | | fps | 1938 | | iterations | 39 | | time_elapsed | 329 | | total_timesteps | 638976 | | train/ | | | approx_kl | 0.0061344057 | | clip_fraction | 0.0666 | | clip_range | 0.2 | | entropy_loss | -2.98 | | explained_variance | 0.928 | | learning_rate | 0.0003 | | loss | -0.0147 | | n_updates | 380 | | policy_gradient_loss | -0.00148 | | std | 1.08 | | value_loss | 0.0386 | ------------------------------------------ Eval num_timesteps=650000, episode_reward=-42.39 +/- 31.99 Episode length: 2000.00 +/- 0.00 ------------------------------------------ | eval/ | | | mean_ep_length | 2e+03 | | mean_reward | -42.4 | | time/ | | | total_timesteps | 650000 | | train/ | | | approx_kl | 0.0061708866 | | clip_fraction | 0.06 | | clip_range | 0.2 | | entropy_loss | -2.98 | | explained_variance | 0.918 | | learning_rate | 0.0003 | | loss | -0.0203 | | n_updates | 390 | | policy_gradient_loss | -0.00313 | | std | 1.07 | | value_loss | 0.0242 | ------------------------------------------ ------------------------------- | time/ | | | fps | 1896 | | iterations | 40 | | time_elapsed | 345 | | total_timesteps | 655360 | ------------------------------- ----------------------------------------- | time/ | | | fps | 1918 | | iterations | 41 | | time_elapsed | 350 | | total_timesteps | 671744 | | train/ | | | approx_kl | 0.007122565 | | clip_fraction | 0.0765 | | clip_range | 0.2 | | entropy_loss | -2.98 | | explained_variance | 0.855 | | learning_rate | 0.0003 | | loss | -0.00749 | | n_updates | 400 | | policy_gradient_loss | -0.00529 | | std | 1.07 | | value_loss | 0.0596 | ----------------------------------------- ------------------------------------------ | time/ | | | fps | 1941 | | iterations | 42 | | time_elapsed | 354 | | total_timesteps | 688128 | | train/ | | | approx_kl | 0.0078532845 | | clip_fraction | 0.0975 | | clip_range | 0.2 | | entropy_loss | -2.98 | | explained_variance | 0.89 | | learning_rate | 0.0003 | | loss | -0.0188 | | n_updates | 410 | | policy_gradient_loss | -0.00699 | | std | 1.07 | | value_loss | 0.0207 | ------------------------------------------ Eval num_timesteps=700000, episode_reward=-39.79 +/- 29.60 Episode length: 2000.00 +/- 0.00 ------------------------------------------ | eval/ | | | mean_ep_length | 2e+03 | | mean_reward | -39.8 | | time/ | | | total_timesteps | 700000 | | train/ | | | approx_kl | 0.0073551387 | | clip_fraction | 0.084 | | clip_range | 0.2 | | entropy_loss | -2.97 | | explained_variance | 0.824 | | learning_rate | 0.0003 | | loss | 0.0126 | | n_updates | 420 | | policy_gradient_loss | -0.0064 | | std | 1.06 | | value_loss | 0.0438 | ------------------------------------------ ------------------------------- | time/ | | | fps | 1904 | | iterations | 43 | | time_elapsed | 370 | | total_timesteps | 704512 | ------------------------------- ----------------------------------------- | time/ | | | fps | 1922 | | iterations | 44 | | time_elapsed | 375 | | total_timesteps | 720896 | | train/ | | | approx_kl | 0.006614036 | | clip_fraction | 0.0611 | | clip_range | 0.2 | | entropy_loss | -2.95 | | explained_variance | 0.881 | | learning_rate | 0.0003 | | loss | -0.0207 | | n_updates | 430 | | policy_gradient_loss | -0.00371 | | std | 1.06 | | value_loss | 0.0244 | ----------------------------------------- ------------------------------------------ | time/ | | | fps | 1940 | | iterations | 45 | | time_elapsed | 380 | | total_timesteps | 737280 | | train/ | | | approx_kl | 0.0060790265 | | clip_fraction | 0.0591 | | clip_range | 0.2 | | entropy_loss | -2.95 | | explained_variance | 0.885 | | learning_rate | 0.0003 | | loss | -0.0284 | | n_updates | 440 | | policy_gradient_loss | -0.00447 | | std | 1.06 | | value_loss | 0.0206 | ------------------------------------------ Eval num_timesteps=750000, episode_reward=-40.21 +/- 27.55 Episode length: 2000.00 +/- 0.00 ------------------------------------------ | eval/ | | | mean_ep_length | 2e+03 | | mean_reward | -40.2 | | time/ | | | total_timesteps | 750000 | | train/ | | | approx_kl | 0.0066163363 | | clip_fraction | 0.0691 | | clip_range | 0.2 | | entropy_loss | -2.96 | | explained_variance | 0.924 | | learning_rate | 0.0003 | | loss | -0.032 | | n_updates | 450 | | policy_gradient_loss | -0.0043 | | std | 1.06 | | value_loss | 0.0127 | ------------------------------------------ [Diag @ 750,000 | n_sheep=2 | success=0%] COMPACT_CANT_DRIVE 14/20 NEVER_COMPACT 5/20 DROVE_NO_SHEEP 1/20 action_mag mean=0.313 p10=0.081 p90=0.638 (0=stopped, 1=full speed) min_flock_radius mean=2.72m best=0.00m (target <5m to compact) min_dog_to_com mean=3.96m best=0.02m (FLEE_DIST=7m) min_com_to_pen mean=12.68m best=2.17m reward/step (mean): progress=-0.0005 alignment=+0.0000 pen_bonus=+0.0008 step_cost=-0.0200 complete=+0.0000 ------------------------------- | time/ | | | fps | 1866 | | iterations | 46 | | time_elapsed | 403 | | total_timesteps | 753664 | ------------------------------- ----------------------------------------- | time/ | | | fps | 1887 | | iterations | 47 | | time_elapsed | 407 | | total_timesteps | 770048 | | train/ | | | approx_kl | 0.005094421 | | clip_fraction | 0.0496 | | clip_range | 0.2 | | entropy_loss | -2.96 | | explained_variance | 0.917 | | learning_rate | 0.0003 | | loss | -0.0237 | | n_updates | 460 | | policy_gradient_loss | -0.00332 | | std | 1.06 | | value_loss | 0.0275 | ----------------------------------------- ----------------------------------------- | time/ | | | fps | 1906 | | iterations | 48 | | time_elapsed | 412 | | total_timesteps | 786432 | | train/ | | | approx_kl | 0.006302662 | | clip_fraction | 0.0571 | | clip_range | 0.2 | | entropy_loss | -2.94 | | explained_variance | 0.944 | | learning_rate | 0.0003 | | loss | -0.0353 | | n_updates | 470 | | policy_gradient_loss | -0.00424 | | std | 1.05 | | value_loss | 0.0201 | ----------------------------------------- Eval num_timesteps=800000, episode_reward=-31.43 +/- 45.97 Episode length: 1953.35 +/- 203.34 ------------------------------------------ | eval/ | | | mean_ep_length | 1.95e+03 | | mean_reward | -31.4 | | time/ | | | total_timesteps | 800000 | | train/ | | | approx_kl | 0.0055750986 | | clip_fraction | 0.0494 | | clip_range | 0.2 | | entropy_loss | -2.95 | | explained_variance | 0.959 | | learning_rate | 0.0003 | | loss | -0.0262 | | n_updates | 480 | | policy_gradient_loss | -0.00386 | | std | 1.06 | | value_loss | 0.0218 | ------------------------------------------ ------------------------------- | time/ | | | fps | 1878 | | iterations | 49 | | time_elapsed | 427 | | total_timesteps | 802816 | ------------------------------- ------------------------------------------ | time/ | | | fps | 1897 | | iterations | 50 | | time_elapsed | 431 | | total_timesteps | 819200 | | train/ | | | approx_kl | 0.0057711033 | | clip_fraction | 0.0568 | | clip_range | 0.2 | | entropy_loss | -2.95 | | explained_variance | 0.838 | | learning_rate | 0.0003 | | loss | -0.0362 | | n_updates | 490 | | policy_gradient_loss | -0.00438 | | std | 1.06 | | value_loss | 0.00952 | ------------------------------------------ ------------------------------------------ | time/ | | | fps | 1914 | | iterations | 51 | | time_elapsed | 436 | | total_timesteps | 835584 | | train/ | | | approx_kl | 0.0073408587 | | clip_fraction | 0.077 | | clip_range | 0.2 | | entropy_loss | -2.96 | | explained_variance | 0.931 | | learning_rate | 0.0003 | | loss | -0.0283 | | n_updates | 500 | | policy_gradient_loss | -0.00553 | | std | 1.07 | | value_loss | 0.0142 | ------------------------------------------ Eval num_timesteps=850000, episode_reward=-37.98 +/- 27.04 Episode length: 2000.00 +/- 0.00 ------------------------------------------ | eval/ | | | mean_ep_length | 2e+03 | | mean_reward | -38 | | time/ | | | total_timesteps | 850000 | | train/ | | | approx_kl | 0.0055803536 | | clip_fraction | 0.0536 | | clip_range | 0.2 | | entropy_loss | -2.96 | | explained_variance | 0.931 | | learning_rate | 0.0003 | | loss | -0.0338 | | n_updates | 510 | | policy_gradient_loss | -0.00469 | | std | 1.06 | | value_loss | 0.0156 | ------------------------------------------ ------------------------------- | time/ | | | fps | 1884 | | iterations | 52 | | time_elapsed | 452 | | total_timesteps | 851968 | ------------------------------- ---------------------------------------- | time/ | | | fps | 1899 | | iterations | 53 | | time_elapsed | 457 | | total_timesteps | 868352 | | train/ | | | approx_kl | 0.00585186 | | clip_fraction | 0.0638 | | clip_range | 0.2 | | entropy_loss | -2.97 | | explained_variance | 0.83 | | learning_rate | 0.0003 | | loss | -0.0333 | | n_updates | 520 | | policy_gradient_loss | -0.00395 | | std | 1.07 | | value_loss | 0.0322 | ---------------------------------------- ------------------------------------------ | time/ | | | fps | 1915 | | iterations | 54 | | time_elapsed | 461 | | total_timesteps | 884736 | | train/ | | | approx_kl | 0.0055105407 | | clip_fraction | 0.045 | | clip_range | 0.2 | | entropy_loss | -2.96 | | explained_variance | 0.845 | | learning_rate | 0.0003 | | loss | -0.0283 | | n_updates | 530 | | policy_gradient_loss | -0.00367 | | std | 1.06 | | value_loss | 0.0109 | ------------------------------------------ Eval num_timesteps=900000, episode_reward=-41.53 +/- 35.40 Episode length: 2000.00 +/- 0.00 ------------------------------------------ | eval/ | | | mean_ep_length | 2e+03 | | mean_reward | -41.5 | | time/ | | | total_timesteps | 900000 | | train/ | | | approx_kl | 0.0064837057 | | clip_fraction | 0.0625 | | clip_range | 0.2 | | entropy_loss | -2.96 | | explained_variance | 0.909 | | learning_rate | 0.0003 | | loss | -0.0394 | | n_updates | 540 | | policy_gradient_loss | -0.00409 | | std | 1.06 | | value_loss | 0.0147 | ------------------------------------------ [Diag @ 900,000 | n_sheep=2 | success=0%] COMPACT_CANT_DRIVE 12/20 NEVER_COMPACT 8/20 action_mag mean=0.276 p10=0.038 p90=0.580 (0=stopped, 1=full speed) min_flock_radius mean=4.30m best=0.98m (target <5m to compact) min_dog_to_com mean=3.24m best=0.24m (FLEE_DIST=7m) min_com_to_pen mean=12.15m best=5.60m reward/step (mean): progress=-0.0048 alignment=+0.0000 pen_bonus=+0.0000 step_cost=-0.0200 complete=+0.0000 ------------------------------- | time/ | | | fps | 1857 | | iterations | 55 | | time_elapsed | 485 | | total_timesteps | 901120 | ------------------------------- ----------------------------------------- | time/ | | | fps | 1874 | | iterations | 56 | | time_elapsed | 489 | | total_timesteps | 917504 | | train/ | | | approx_kl | 0.006582682 | | clip_fraction | 0.0662 | | clip_range | 0.2 | | entropy_loss | -2.97 | | explained_variance | 0.961 | | learning_rate | 0.0003 | | loss | -0.039 | | n_updates | 550 | | policy_gradient_loss | -0.00462 | | std | 1.07 | | value_loss | 0.0103 | ----------------------------------------- ------------------------------------------ | time/ | | | fps | 1888 | | iterations | 57 | | time_elapsed | 494 | | total_timesteps | 933888 | | train/ | | | approx_kl | 0.0059698187 | | clip_fraction | 0.0573 | | clip_range | 0.2 | | entropy_loss | -2.97 | | explained_variance | 0.907 | | learning_rate | 0.0003 | | loss | -0.0291 | | n_updates | 560 | | policy_gradient_loss | -0.00446 | | std | 1.07 | | value_loss | 0.0113 | ------------------------------------------ Eval num_timesteps=950000, episode_reward=-26.73 +/- 22.82 Episode length: 2000.00 +/- 0.00 ----------------------------------------- | eval/ | | | mean_ep_length | 2e+03 | | mean_reward | -26.7 | | time/ | | | total_timesteps | 950000 | | train/ | | | approx_kl | 0.006601461 | | clip_fraction | 0.0594 | | clip_range | 0.2 | | entropy_loss | -2.96 | | explained_variance | 0.872 | | learning_rate | 0.0003 | | loss | -0.034 | | n_updates | 570 | | policy_gradient_loss | -0.00455 | | std | 1.06 | | value_loss | 0.00901 | ----------------------------------------- ------------------------------- | time/ | | | fps | 1856 | | iterations | 58 | | time_elapsed | 511 | | total_timesteps | 950272 | ------------------------------- ----------------------------------------- | time/ | | | fps | 1869 | | iterations | 59 | | time_elapsed | 517 | | total_timesteps | 966656 | | train/ | | | approx_kl | 0.005824944 | | clip_fraction | 0.0624 | | clip_range | 0.2 | | entropy_loss | -2.96 | | explained_variance | 0.789 | | learning_rate | 0.0003 | | loss | -0.0214 | | n_updates | 580 | | policy_gradient_loss | -0.00363 | | std | 1.07 | | value_loss | 0.0359 | ----------------------------------------- ----------------------------------------- | time/ | | | fps | 1882 | | iterations | 60 | | time_elapsed | 522 | | total_timesteps | 983040 | | train/ | | | approx_kl | 0.005888001 | | clip_fraction | 0.0573 | | clip_range | 0.2 | | entropy_loss | -2.98 | | explained_variance | 0.887 | | learning_rate | 0.0003 | | loss | -0.0391 | | n_updates | 590 | | policy_gradient_loss | -0.00371 | | std | 1.07 | | value_loss | 0.00935 | ----------------------------------------- ----------------------------------------- | time/ | | | fps | 1895 | | iterations | 61 | | time_elapsed | 527 | | total_timesteps | 999424 | | train/ | | | approx_kl | 0.005874036 | | clip_fraction | 0.0611 | | clip_range | 0.2 | | entropy_loss | -2.98 | | explained_variance | 0.871 | | learning_rate | 0.0003 | | loss | -0.0246 | | n_updates | 600 | | policy_gradient_loss | -0.00492 | | std | 1.07 | | value_loss | 0.00877 | ----------------------------------------- Eval num_timesteps=1000000, episode_reward=-22.72 +/- 33.15 Episode length: 2000.00 +/- 0.00 ------------------------------------------ | eval/ | | | mean_ep_length | 2e+03 | | mean_reward | -22.7 | | time/ | | | total_timesteps | 1000000 | | train/ | | | approx_kl | 0.0060388125 | | clip_fraction | 0.0637 | | clip_range | 0.2 | | entropy_loss | -2.97 | | explained_variance | 0.737 | | learning_rate | 0.0003 | | loss | -0.0511 | | n_updates | 610 | | policy_gradient_loss | -0.00387 | | std | 1.07 | | value_loss | 0.0538 | ------------------------------------------ New best mean reward! -------------------------------- | time/ | | | fps | 1869 | | iterations | 62 | | time_elapsed | 543 | | total_timesteps | 1015808 | -------------------------------- ----------------------------------------- | time/ | | | fps | 1882 | | iterations | 63 | | time_elapsed | 548 | | total_timesteps | 1032192 | | train/ | | | approx_kl | 0.007320485 | | clip_fraction | 0.0723 | | clip_range | 0.2 | | entropy_loss | -2.99 | | explained_variance | 0.946 | | learning_rate | 0.0003 | | loss | -0.0342 | | n_updates | 620 | | policy_gradient_loss | -0.0052 | | std | 1.08 | | value_loss | 0.0174 | ----------------------------------------- ------------------------------------------ | time/ | | | fps | 1894 | | iterations | 64 | | time_elapsed | 553 | | total_timesteps | 1048576 | | train/ | | | approx_kl | 0.0066477214 | | clip_fraction | 0.0621 | | clip_range | 0.2 | | entropy_loss | -3 | | explained_variance | 0.919 | | learning_rate | 0.0003 | | loss | -0.0301 | | n_updates | 630 | | policy_gradient_loss | -0.00449 | | std | 1.08 | | value_loss | 0.0109 | ------------------------------------------ Eval num_timesteps=1050000, episode_reward=-39.86 +/- 28.77 Episode length: 2000.00 +/- 0.00 ------------------------------------------ | eval/ | | | mean_ep_length | 2e+03 | | mean_reward | -39.9 | | time/ | | | total_timesteps | 1050000 | | train/ | | | approx_kl | 0.0066243596 | | clip_fraction | 0.0772 | | clip_range | 0.2 | | entropy_loss | -2.99 | | explained_variance | 0.861 | | learning_rate | 0.0003 | | loss | -0.0313 | | n_updates | 640 | | policy_gradient_loss | -0.00462 | | std | 1.07 | | value_loss | 0.0324 | ------------------------------------------ [Diag @ 1,050,000 | n_sheep=2 | success=0%] COMPACT_CANT_DRIVE 18/20 NEVER_COMPACT 2/20 action_mag mean=0.200 p10=0.022 p90=0.478 (0=stopped, 1=full speed) min_flock_radius mean=2.29m best=0.00m (target <5m to compact) min_dog_to_com mean=3.23m best=0.05m (FLEE_DIST=7m) min_com_to_pen mean=12.84m best=3.77m reward/step (mean): progress=+0.0016 alignment=+0.0000 pen_bonus=+0.0003 step_cost=-0.0200 complete=+0.0000 -------------------------------- | time/ | | | fps | 1843 | | iterations | 65 | | time_elapsed | 577 | | total_timesteps | 1064960 | -------------------------------- ------------------------------------------ | time/ | | | fps | 1855 | | iterations | 66 | | time_elapsed | 582 | | total_timesteps | 1081344 | | train/ | | | approx_kl | 0.0066154073 | | clip_fraction | 0.0657 | | clip_range | 0.2 | | entropy_loss | -2.99 | | explained_variance | 0.836 | | learning_rate | 0.0003 | | loss | -0.029 | | n_updates | 650 | | policy_gradient_loss | -0.0049 | | std | 1.08 | | value_loss | 0.0135 | ------------------------------------------ ------------------------------------------ | time/ | | | fps | 1865 | | iterations | 67 | | time_elapsed | 588 | | total_timesteps | 1097728 | | train/ | | | approx_kl | 0.0059733046 | | clip_fraction | 0.0634 | | clip_range | 0.2 | | entropy_loss | -3.01 | | explained_variance | 0.852 | | learning_rate | 0.0003 | | loss | -0.0254 | | n_updates | 660 | | policy_gradient_loss | -0.00452 | | std | 1.09 | | value_loss | 0.0395 | ------------------------------------------ Eval num_timesteps=1100000, episode_reward=-33.30 +/- 26.65 Episode length: 2000.00 +/- 0.00 ------------------------------------------ | eval/ | | | mean_ep_length | 2e+03 | | mean_reward | -33.3 | | time/ | | | total_timesteps | 1100000 | | train/ | | | approx_kl | 0.0054050894 | | clip_fraction | 0.048 | | clip_range | 0.2 | | entropy_loss | -3.02 | | explained_variance | 0.851 | | learning_rate | 0.0003 | | loss | -0.0348 | | n_updates | 670 | | policy_gradient_loss | -0.00385 | | std | 1.1 | | value_loss | 0.0247 | ------------------------------------------ -------------------------------- | time/ | | | fps | 1843 | | iterations | 68 | | time_elapsed | 604 | | total_timesteps | 1114112 | -------------------------------- ------------------------------------------ | time/ | | | fps | 1856 | | iterations | 69 | | time_elapsed | 608 | | total_timesteps | 1130496 | | train/ | | | approx_kl | 0.0073612374 | | clip_fraction | 0.076 | | clip_range | 0.2 | | entropy_loss | -3.01 | | explained_variance | 0.885 | | learning_rate | 0.0003 | | loss | -0.0424 | | n_updates | 680 | | policy_gradient_loss | -0.00512 | | std | 1.09 | | value_loss | 0.0278 | ------------------------------------------ ------------------------------------------ | time/ | | | fps | 1869 | | iterations | 70 | | time_elapsed | 613 | | total_timesteps | 1146880 | | train/ | | | approx_kl | 0.0063554104 | | clip_fraction | 0.067 | | clip_range | 0.2 | | entropy_loss | -3.01 | | explained_variance | 0.915 | | learning_rate | 0.0003 | | loss | -0.0302 | | n_updates | 690 | | policy_gradient_loss | -0.00577 | | std | 1.09 | | value_loss | 0.0116 | ------------------------------------------ Eval num_timesteps=1150000, episode_reward=-26.91 +/- 26.08 Episode length: 2000.00 +/- 0.00 ----------------------------------------- | eval/ | | | mean_ep_length | 2e+03 | | mean_reward | -26.9 | | time/ | | | total_timesteps | 1150000 | | train/ | | | approx_kl | 0.006060633 | | clip_fraction | 0.0603 | | clip_range | 0.2 | | entropy_loss | -3.02 | | explained_variance | 0.905 | | learning_rate | 0.0003 | | loss | -0.0374 | | n_updates | 700 | | policy_gradient_loss | -0.00442 | | std | 1.1 | | value_loss | 0.0101 | ----------------------------------------- -------------------------------- | time/ | | | fps | 1847 | | iterations | 71 | | time_elapsed | 629 | | total_timesteps | 1163264 | -------------------------------- ------------------------------------------ | time/ | | | fps | 1859 | | iterations | 72 | | time_elapsed | 634 | | total_timesteps | 1179648 | | train/ | | | approx_kl | 0.0070389216 | | clip_fraction | 0.0728 | | clip_range | 0.2 | | entropy_loss | -3.03 | | explained_variance | 0.854 | | learning_rate | 0.0003 | | loss | -0.0409 | | n_updates | 710 | | policy_gradient_loss | -0.00505 | | std | 1.1 | | value_loss | 0.0196 | ------------------------------------------ ------------------------------------------ | time/ | | | fps | 1871 | | iterations | 73 | | time_elapsed | 638 | | total_timesteps | 1196032 | | train/ | | | approx_kl | 0.0055403598 | | clip_fraction | 0.0567 | | clip_range | 0.2 | | entropy_loss | -3.03 | | explained_variance | 0.906 | | learning_rate | 0.0003 | | loss | -0.0324 | | n_updates | 720 | | policy_gradient_loss | -0.00494 | | std | 1.1 | | value_loss | 0.0109 | ------------------------------------------ Eval num_timesteps=1200000, episode_reward=-23.57 +/- 26.30 Episode length: 2000.00 +/- 0.00 ------------------------------------------ | eval/ | | | mean_ep_length | 2e+03 | | mean_reward | -23.6 | | time/ | | | total_timesteps | 1200000 | | train/ | | | approx_kl | 0.0055604624 | | clip_fraction | 0.0522 | | clip_range | 0.2 | | entropy_loss | -3.02 | | explained_variance | 0.819 | | learning_rate | 0.0003 | | loss | -0.00379 | | n_updates | 730 | | policy_gradient_loss | -0.00374 | | std | 1.1 | | value_loss | 0.0453 | ------------------------------------------ [Diag @ 1,200,000 | n_sheep=2 | success=0%] COMPACT_CANT_DRIVE 15/20 NEVER_COMPACT 4/20 DROVE_NO_SHEEP 1/20 action_mag mean=0.399 p10=0.067 p90=0.794 (0=stopped, 1=full speed) min_flock_radius mean=2.96m best=0.00m (target <5m to compact) min_dog_to_com mean=2.17m best=0.14m (FLEE_DIST=7m) min_com_to_pen mean=11.07m best=2.66m reward/step (mean): progress=+0.0064 alignment=+0.0000 pen_bonus=+0.0003 step_cost=-0.0200 complete=+0.0000 [Curriculum] leaving stage n_sheep=2 after 600,000 steps | training success rate (last 100 eps) = 0% [Curriculum] → 3 sheep at step 1,200,000 -------------------------------- | time/ | | | fps | 1828 | | iterations | 74 | | time_elapsed | 663 | | total_timesteps | 1212416 | -------------------------------- ----------------------------------------- | time/ | | | fps | 1839 | | iterations | 75 | | time_elapsed | 668 | | total_timesteps | 1228800 | | train/ | | | approx_kl | 0.007044647 | | clip_fraction | 0.0819 | | clip_range | 0.2 | | entropy_loss | -3.02 | | explained_variance | 0.902 | | learning_rate | 0.0003 | | loss | -0.00823 | | n_updates | 740 | | policy_gradient_loss | -0.00327 | | std | 1.1 | | value_loss | 0.042 | ----------------------------------------- ------------------------------------------ | time/ | | | fps | 1849 | | iterations | 76 | | time_elapsed | 673 | | total_timesteps | 1245184 | | train/ | | | approx_kl | 0.0064169513 | | clip_fraction | 0.0699 | | clip_range | 0.2 | | entropy_loss | -3.03 | | explained_variance | 0.928 | | learning_rate | 0.0003 | | loss | -0.0323 | | n_updates | 750 | | policy_gradient_loss | -0.00459 | | std | 1.1 | | value_loss | 0.0102 | ------------------------------------------ Eval num_timesteps=1250000, episode_reward=-27.97 +/- 37.55 Episode length: 2000.00 +/- 0.00 ----------------------------------------- | eval/ | | | mean_ep_length | 2e+03 | | mean_reward | -28 | | time/ | | | total_timesteps | 1250000 | | train/ | | | approx_kl | 0.006859841 | | clip_fraction | 0.0783 | | clip_range | 0.2 | | entropy_loss | -3.04 | | explained_variance | 0.94 | | learning_rate | 0.0003 | | loss | -0.0368 | | n_updates | 760 | | policy_gradient_loss | -0.00472 | | std | 1.11 | | value_loss | 0.00931 | ----------------------------------------- -------------------------------- | time/ | | | fps | 1825 | | iterations | 77 | | time_elapsed | 691 | | total_timesteps | 1261568 | -------------------------------- ------------------------------------------ | time/ | | | fps | 1836 | | iterations | 78 | | time_elapsed | 696 | | total_timesteps | 1277952 | | train/ | | | approx_kl | 0.0066901552 | | clip_fraction | 0.0704 | | clip_range | 0.2 | | entropy_loss | -3.04 | | explained_variance | 0.942 | | learning_rate | 0.0003 | | loss | -0.0329 | | n_updates | 770 | | policy_gradient_loss | -0.00458 | | std | 1.11 | | value_loss | 0.00938 | ------------------------------------------ ----------------------------------------- | time/ | | | fps | 1845 | | iterations | 79 | | time_elapsed | 701 | | total_timesteps | 1294336 | | train/ | | | approx_kl | 0.007008245 | | clip_fraction | 0.082 | | clip_range | 0.2 | | entropy_loss | -3.03 | | explained_variance | 0.899 | | learning_rate | 0.0003 | | loss | -0.0194 | | n_updates | 780 | | policy_gradient_loss | -0.00426 | | std | 1.1 | | value_loss | 0.052 | ----------------------------------------- Eval num_timesteps=1300000, episode_reward=-41.12 +/- 37.68 Episode length: 2000.00 +/- 0.00 ------------------------------------------ | eval/ | | | mean_ep_length | 2e+03 | | mean_reward | -41.1 | | time/ | | | total_timesteps | 1300000 | | train/ | | | approx_kl | 0.0070775724 | | clip_fraction | 0.0742 | | clip_range | 0.2 | | entropy_loss | -3.03 | | explained_variance | 0.942 | | learning_rate | 0.0003 | | loss | -0.0238 | | n_updates | 790 | | policy_gradient_loss | -0.0052 | | std | 1.11 | | value_loss | 0.00657 | ------------------------------------------ -------------------------------- | time/ | | | fps | 1823 | | iterations | 80 | | time_elapsed | 718 | | total_timesteps | 1310720 | -------------------------------- ----------------------------------------- | time/ | | | fps | 1832 | | iterations | 81 | | time_elapsed | 724 | | total_timesteps | 1327104 | | train/ | | | approx_kl | 0.008046751 | | clip_fraction | 0.0851 | | clip_range | 0.2 | | entropy_loss | -3.04 | | explained_variance | 0.897 | | learning_rate | 0.0003 | | loss | -0.0384 | | n_updates | 800 | | policy_gradient_loss | -0.0057 | | std | 1.11 | | value_loss | 0.009 | ----------------------------------------- ----------------------------------------- | time/ | | | fps | 1840 | | iterations | 82 | | time_elapsed | 730 | | total_timesteps | 1343488 | | train/ | | | approx_kl | 0.006007643 | | clip_fraction | 0.0548 | | clip_range | 0.2 | | entropy_loss | -3.06 | | explained_variance | 0.871 | | learning_rate | 0.0003 | | loss | -0.0251 | | n_updates | 810 | | policy_gradient_loss | -0.00416 | | std | 1.12 | | value_loss | 0.0179 | ----------------------------------------- Eval num_timesteps=1350000, episode_reward=-24.46 +/- 41.24 Episode length: 2000.00 +/- 0.00 ------------------------------------------ | eval/ | | | mean_ep_length | 2e+03 | | mean_reward | -24.5 | | time/ | | | total_timesteps | 1350000 | | train/ | | | approx_kl | 0.0065572546 | | clip_fraction | 0.0698 | | clip_range | 0.2 | | entropy_loss | -3.08 | | explained_variance | 0.877 | | learning_rate | 0.0003 | | loss | -0.0219 | | n_updates | 820 | | policy_gradient_loss | -0.00456 | | std | 1.13 | | value_loss | 0.0242 | ------------------------------------------ [Diag @ 1,350,000 | n_sheep=3 | success=0%] NEVER_COMPACT 14/20 COMPACT_CANT_DRIVE 6/20 action_mag mean=0.195 p10=0.018 p90=0.576 (0=stopped, 1=full speed) min_flock_radius mean=6.32m best=1.36m (target <5m to compact) min_dog_to_com mean=4.15m best=0.61m (FLEE_DIST=7m) min_com_to_pen mean=11.37m best=4.88m reward/step (mean): progress=+0.0029 alignment=+0.0000 pen_bonus=+0.0000 step_cost=-0.0200 complete=+0.0000 -------------------------------- | time/ | | | fps | 1798 | | iterations | 83 | | time_elapsed | 756 | | total_timesteps | 1359872 | -------------------------------- ------------------------------------------ | time/ | | | fps | 1809 | | iterations | 84 | | time_elapsed | 760 | | total_timesteps | 1376256 | | train/ | | | approx_kl | 0.0072198315 | | clip_fraction | 0.0764 | | clip_range | 0.2 | | entropy_loss | -3.08 | | explained_variance | 0.909 | | learning_rate | 0.0003 | | loss | -0.0208 | | n_updates | 830 | | policy_gradient_loss | -0.00626 | | std | 1.13 | | value_loss | 0.0106 | ------------------------------------------ ------------------------------------------ | time/ | | | fps | 1817 | | iterations | 85 | | time_elapsed | 766 | | total_timesteps | 1392640 | | train/ | | | approx_kl | 0.0070813587 | | clip_fraction | 0.0733 | | clip_range | 0.2 | | entropy_loss | -3.08 | | explained_variance | 0.907 | | learning_rate | 0.0003 | | loss | -0.0324 | | n_updates | 840 | | policy_gradient_loss | -0.00505 | | std | 1.13 | | value_loss | 0.0166 | ------------------------------------------ Eval num_timesteps=1400000, episode_reward=-36.32 +/- 33.15 Episode length: 2000.00 +/- 0.00 ------------------------------------------ | eval/ | | | mean_ep_length | 2e+03 | | mean_reward | -36.3 | | time/ | | | total_timesteps | 1400000 | | train/ | | | approx_kl | 0.0067584305 | | clip_fraction | 0.08 | | clip_range | 0.2 | | entropy_loss | -3.08 | | explained_variance | 0.906 | | learning_rate | 0.0003 | | loss | -0.0308 | | n_updates | 850 | | policy_gradient_loss | -0.0054 | | std | 1.13 | | value_loss | 0.0112 | ------------------------------------------ -------------------------------- | time/ | | | fps | 1798 | | iterations | 86 | | time_elapsed | 783 | | total_timesteps | 1409024 | -------------------------------- ----------------------------------------- | time/ | | | fps | 1807 | | iterations | 87 | | time_elapsed | 788 | | total_timesteps | 1425408 | | train/ | | | approx_kl | 0.007411341 | | clip_fraction | 0.0716 | | clip_range | 0.2 | | entropy_loss | -3.09 | | explained_variance | 0.904 | | learning_rate | 0.0003 | | loss | -0.0322 | | n_updates | 860 | | policy_gradient_loss | -0.00641 | | std | 1.14 | | value_loss | 0.0191 | ----------------------------------------- ------------------------------------------ | time/ | | | fps | 1815 | | iterations | 88 | | time_elapsed | 794 | | total_timesteps | 1441792 | | train/ | | | approx_kl | 0.0077011855 | | clip_fraction | 0.0774 | | clip_range | 0.2 | | entropy_loss | -3.09 | | explained_variance | 0.914 | | learning_rate | 0.0003 | | loss | -0.0316 | | n_updates | 870 | | policy_gradient_loss | -0.00545 | | std | 1.13 | | value_loss | 0.0148 | ------------------------------------------ Eval num_timesteps=1450000, episode_reward=-40.58 +/- 38.17 Episode length: 2000.00 +/- 0.00 ----------------------------------------- | eval/ | | | mean_ep_length | 2e+03 | | mean_reward | -40.6 | | time/ | | | total_timesteps | 1450000 | | train/ | | | approx_kl | 0.007694071 | | clip_fraction | 0.0816 | | clip_range | 0.2 | | entropy_loss | -3.07 | | explained_variance | 0.937 | | learning_rate | 0.0003 | | loss | -0.036 | | n_updates | 880 | | policy_gradient_loss | -0.0054 | | std | 1.12 | | value_loss | 0.0111 | ----------------------------------------- -------------------------------- | time/ | | | fps | 1796 | | iterations | 89 | | time_elapsed | 811 | | total_timesteps | 1458176 | -------------------------------- ----------------------------------------- | time/ | | | fps | 1805 | | iterations | 90 | | time_elapsed | 816 | | total_timesteps | 1474560 | | train/ | | | approx_kl | 0.007034345 | | clip_fraction | 0.0693 | | clip_range | 0.2 | | entropy_loss | -3.07 | | explained_variance | 0.924 | | learning_rate | 0.0003 | | loss | 0.0472 | | n_updates | 890 | | policy_gradient_loss | -0.00472 | | std | 1.13 | | value_loss | 0.0352 | ----------------------------------------- ------------------------------------------ | time/ | | | fps | 1815 | | iterations | 91 | | time_elapsed | 821 | | total_timesteps | 1490944 | | train/ | | | approx_kl | 0.0078114523 | | clip_fraction | 0.0917 | | clip_range | 0.2 | | entropy_loss | -3.08 | | explained_variance | 0.942 | | learning_rate | 0.0003 | | loss | -0.0461 | | n_updates | 900 | | policy_gradient_loss | -0.00668 | | std | 1.13 | | value_loss | 0.00844 | ------------------------------------------ Eval num_timesteps=1500000, episode_reward=-19.66 +/- 25.98 Episode length: 2000.00 +/- 0.00 ------------------------------------------ | eval/ | | | mean_ep_length | 2e+03 | | mean_reward | -19.7 | | time/ | | | total_timesteps | 1500000 | | train/ | | | approx_kl | 0.0067999987 | | clip_fraction | 0.0606 | | clip_range | 0.2 | | entropy_loss | -3.08 | | explained_variance | 0.893 | | learning_rate | 0.0003 | | loss | -0.0283 | | n_updates | 910 | | policy_gradient_loss | -0.00385 | | std | 1.12 | | value_loss | 0.0409 | ------------------------------------------ New best mean reward! [Diag @ 1,500,000 | n_sheep=3 | success=0%] COMPACT_CANT_DRIVE 11/20 NEVER_COMPACT 7/20 DROVE_NO_SHEEP 2/20 action_mag mean=0.185 p10=0.015 p90=0.426 (0=stopped, 1=full speed) min_flock_radius mean=4.43m best=1.38m (target <5m to compact) min_dog_to_com mean=2.89m best=0.07m (FLEE_DIST=7m) min_com_to_pen mean=11.88m best=2.23m reward/step (mean): progress=+0.0008 alignment=+0.0000 pen_bonus=+0.0003 step_cost=-0.0200 complete=+0.0000 -------------------------------- | time/ | | | fps | 1781 | | iterations | 92 | | time_elapsed | 846 | | total_timesteps | 1507328 | -------------------------------- ------------------------------------------ | time/ | | | fps | 1789 | | iterations | 93 | | time_elapsed | 851 | | total_timesteps | 1523712 | | train/ | | | approx_kl | 0.0069550863 | | clip_fraction | 0.0787 | | clip_range | 0.2 | | entropy_loss | -3.08 | | explained_variance | 0.897 | | learning_rate | 0.0003 | | loss | -0.0204 | | n_updates | 920 | | policy_gradient_loss | -0.00394 | | std | 1.13 | | value_loss | 0.0324 | ------------------------------------------ ----------------------------------------- | time/ | | | fps | 1798 | | iterations | 94 | | time_elapsed | 856 | | total_timesteps | 1540096 | | train/ | | | approx_kl | 0.006749108 | | clip_fraction | 0.0787 | | clip_range | 0.2 | | entropy_loss | -3.08 | | explained_variance | 0.929 | | learning_rate | 0.0003 | | loss | -0.0338 | | n_updates | 930 | | policy_gradient_loss | -0.00534 | | std | 1.13 | | value_loss | 0.00967 | ----------------------------------------- Eval num_timesteps=1550000, episode_reward=-26.47 +/- 25.94 Episode length: 2000.00 +/- 0.00 ------------------------------------------ | eval/ | | | mean_ep_length | 2e+03 | | mean_reward | -26.5 | | time/ | | | total_timesteps | 1550000 | | train/ | | | approx_kl | 0.0073381998 | | clip_fraction | 0.0679 | | clip_range | 0.2 | | entropy_loss | -3.08 | | explained_variance | 0.919 | | learning_rate | 0.0003 | | loss | -0.0259 | | n_updates | 940 | | policy_gradient_loss | -0.00554 | | std | 1.13 | | value_loss | 0.00999 | ------------------------------------------ -------------------------------- | time/ | | | fps | 1782 | | iterations | 95 | | time_elapsed | 873 | | total_timesteps | 1556480 | -------------------------------- ------------------------------------------ | time/ | | | fps | 1790 | | iterations | 96 | | time_elapsed | 878 | | total_timesteps | 1572864 | | train/ | | | approx_kl | 0.0071112993 | | clip_fraction | 0.0781 | | clip_range | 0.2 | | entropy_loss | -3.08 | | explained_variance | 0.929 | | learning_rate | 0.0003 | | loss | -0.0324 | | n_updates | 950 | | policy_gradient_loss | -0.00428 | | std | 1.13 | | value_loss | 0.0246 | ------------------------------------------ ------------------------------------------ | time/ | | | fps | 1798 | | iterations | 97 | | time_elapsed | 883 | | total_timesteps | 1589248 | | train/ | | | approx_kl | 0.0077134473 | | clip_fraction | 0.0784 | | clip_range | 0.2 | | entropy_loss | -3.08 | | explained_variance | 0.917 | | learning_rate | 0.0003 | | loss | -0.0365 | | n_updates | 960 | | policy_gradient_loss | -0.00445 | | std | 1.13 | | value_loss | 0.0122 | ------------------------------------------ Eval num_timesteps=1600000, episode_reward=-35.13 +/- 31.01 Episode length: 2000.00 +/- 0.00 ------------------------------------------ | eval/ | | | mean_ep_length | 2e+03 | | mean_reward | -35.1 | | time/ | | | total_timesteps | 1600000 | | train/ | | | approx_kl | 0.0070123896 | | clip_fraction | 0.0712 | | clip_range | 0.2 | | entropy_loss | -3.07 | | explained_variance | 0.919 | | learning_rate | 0.0003 | | loss | -0.026 | | n_updates | 970 | | policy_gradient_loss | -0.00519 | | std | 1.13 | | value_loss | 0.0171 | ------------------------------------------ -------------------------------- | time/ | | | fps | 1781 | | iterations | 98 | | time_elapsed | 901 | | total_timesteps | 1605632 | -------------------------------- ----------------------------------------- | time/ | | | fps | 1789 | | iterations | 99 | | time_elapsed | 906 | | total_timesteps | 1622016 | | train/ | | | approx_kl | 0.007990176 | | clip_fraction | 0.0845 | | clip_range | 0.2 | | entropy_loss | -3.07 | | explained_variance | 0.873 | | learning_rate | 0.0003 | | loss | -0.04 | | n_updates | 980 | | policy_gradient_loss | -0.0045 | | std | 1.13 | | value_loss | 0.0153 | ----------------------------------------- ----------------------------------------- | time/ | | | fps | 1798 | | iterations | 100 | | time_elapsed | 911 | | total_timesteps | 1638400 | | train/ | | | approx_kl | 0.006477687 | | clip_fraction | 0.0593 | | clip_range | 0.2 | | entropy_loss | -3.07 | | explained_variance | 0.946 | | learning_rate | 0.0003 | | loss | -0.0396 | | n_updates | 990 | | policy_gradient_loss | -0.00442 | | std | 1.13 | | value_loss | 0.0107 | ----------------------------------------- Eval num_timesteps=1650000, episode_reward=-31.86 +/- 47.05 Episode length: 2000.00 +/- 0.00 ----------------------------------------- | eval/ | | | mean_ep_length | 2e+03 | | mean_reward | -31.9 | | time/ | | | total_timesteps | 1650000 | | train/ | | | approx_kl | 0.006796476 | | clip_fraction | 0.0672 | | clip_range | 0.2 | | entropy_loss | -3.08 | | explained_variance | 0.929 | | learning_rate | 0.0003 | | loss | -0.0264 | | n_updates | 1000 | | policy_gradient_loss | -0.00375 | | std | 1.13 | | value_loss | 0.0385 | ----------------------------------------- [Diag @ 1,650,000 | n_sheep=3 | success=0%] NEVER_COMPACT 11/20 COMPACT_CANT_DRIVE 9/20 action_mag mean=0.154 p10=0.005 p90=0.398 (0=stopped, 1=full speed) min_flock_radius mean=5.81m best=0.00m (target <5m to compact) min_dog_to_com mean=3.22m best=0.52m (FLEE_DIST=7m) min_com_to_pen mean=13.42m best=7.08m reward/step (mean): progress=+0.0061 alignment=+0.0000 pen_bonus=+0.0010 step_cost=-0.0200 complete=+0.0000 -------------------------------- | time/ | | | fps | 1768 | | iterations | 101 | | time_elapsed | 935 | | total_timesteps | 1654784 | -------------------------------- ---------------------------------------- | time/ | | | fps | 1774 | | iterations | 102 | | time_elapsed | 941 | | total_timesteps | 1671168 | | train/ | | | approx_kl | 0.00682881 | | clip_fraction | 0.0694 | | clip_range | 0.2 | | entropy_loss | -3.08 | | explained_variance | 0.939 | | learning_rate | 0.0003 | | loss | -0.0233 | | n_updates | 1010 | | policy_gradient_loss | -0.00461 | | std | 1.13 | | value_loss | 0.0183 | ---------------------------------------- ------------------------------------------ | time/ | | | fps | 1779 | | iterations | 103 | | time_elapsed | 948 | | total_timesteps | 1687552 | | train/ | | | approx_kl | 0.0071003223 | | clip_fraction | 0.0782 | | clip_range | 0.2 | | entropy_loss | -3.1 | | explained_variance | 0.923 | | learning_rate | 0.0003 | | loss | -0.0398 | | n_updates | 1020 | | policy_gradient_loss | -0.00491 | | std | 1.15 | | value_loss | 0.0101 | ------------------------------------------ Eval num_timesteps=1700000, episode_reward=-32.11 +/- 36.59 Episode length: 2000.00 +/- 0.00 ------------------------------------------ | eval/ | | | mean_ep_length | 2e+03 | | mean_reward | -32.1 | | time/ | | | total_timesteps | 1700000 | | train/ | | | approx_kl | 0.0064870613 | | clip_fraction | 0.0624 | | clip_range | 0.2 | | entropy_loss | -3.13 | | explained_variance | 0.909 | | learning_rate | 0.0003 | | loss | -0.0365 | | n_updates | 1030 | | policy_gradient_loss | -0.00404 | | std | 1.17 | | value_loss | 0.00855 | ------------------------------------------ -------------------------------- | time/ | | | fps | 1762 | | iterations | 104 | | time_elapsed | 966 | | total_timesteps | 1703936 | -------------------------------- ----------------------------------------- | time/ | | | fps | 1769 | | iterations | 105 | | time_elapsed | 972 | | total_timesteps | 1720320 | | train/ | | | approx_kl | 0.007349294 | | clip_fraction | 0.0833 | | clip_range | 0.2 | | entropy_loss | -3.15 | | explained_variance | 0.926 | | learning_rate | 0.0003 | | loss | -0.0358 | | n_updates | 1040 | | policy_gradient_loss | -0.00514 | | std | 1.17 | | value_loss | 0.00848 | ----------------------------------------- ------------------------------------------ | time/ | | | fps | 1777 | | iterations | 106 | | time_elapsed | 976 | | total_timesteps | 1736704 | | train/ | | | approx_kl | 0.0070306472 | | clip_fraction | 0.0814 | | clip_range | 0.2 | | entropy_loss | -3.15 | | explained_variance | 0.887 | | learning_rate | 0.0003 | | loss | -0.0359 | | n_updates | 1050 | | policy_gradient_loss | -0.00489 | | std | 1.17 | | value_loss | 0.0134 | ------------------------------------------ Eval num_timesteps=1750000, episode_reward=-34.24 +/- 43.23 Episode length: 2000.00 +/- 0.00 ----------------------------------------- | eval/ | | | mean_ep_length | 2e+03 | | mean_reward | -34.2 | | time/ | | | total_timesteps | 1750000 | | train/ | | | approx_kl | 0.008487761 | | clip_fraction | 0.102 | | clip_range | 0.2 | | entropy_loss | -3.15 | | explained_variance | 0.962 | | learning_rate | 0.0003 | | loss | -0.0369 | | n_updates | 1060 | | policy_gradient_loss | -0.0077 | | std | 1.17 | | value_loss | 0.00786 | ----------------------------------------- -------------------------------- | time/ | | | fps | 1762 | | iterations | 107 | | time_elapsed | 994 | | total_timesteps | 1753088 | -------------------------------- ------------------------------------------ | time/ | | | fps | 1766 | | iterations | 108 | | time_elapsed | 1001 | | total_timesteps | 1769472 | | train/ | | | approx_kl | 0.0074267983 | | clip_fraction | 0.0742 | | clip_range | 0.2 | | entropy_loss | -3.15 | | explained_variance | 0.939 | | learning_rate | 0.0003 | | loss | -0.0404 | | n_updates | 1070 | | policy_gradient_loss | -0.00575 | | std | 1.18 | | value_loss | 0.0158 | ------------------------------------------ ------------------------------------------ | time/ | | | fps | 1772 | | iterations | 109 | | time_elapsed | 1007 | | total_timesteps | 1785856 | | train/ | | | approx_kl | 0.0075380025 | | clip_fraction | 0.074 | | clip_range | 0.2 | | entropy_loss | -3.15 | | explained_variance | 0.961 | | learning_rate | 0.0003 | | loss | -0.034 | | n_updates | 1080 | | policy_gradient_loss | -0.00553 | | std | 1.17 | | value_loss | 0.00651 | ------------------------------------------ Eval num_timesteps=1800000, episode_reward=-31.16 +/- 37.32 Episode length: 2000.00 +/- 0.00 ----------------------------------------- | eval/ | | | mean_ep_length | 2e+03 | | mean_reward | -31.2 | | time/ | | | total_timesteps | 1800000 | | train/ | | | approx_kl | 0.007386248 | | clip_fraction | 0.0843 | | clip_range | 0.2 | | entropy_loss | -3.15 | | explained_variance | 0.922 | | learning_rate | 0.0003 | | loss | -0.0419 | | n_updates | 1090 | | policy_gradient_loss | -0.00596 | | std | 1.17 | | value_loss | 0.00858 | ----------------------------------------- [Diag @ 1,800,000 | n_sheep=3 | success=0%] NEVER_COMPACT 17/20 COMPACT_CANT_DRIVE 3/20 action_mag mean=0.164 p10=0.007 p90=0.418 (0=stopped, 1=full speed) min_flock_radius mean=7.52m best=2.00m (target <5m to compact) min_dog_to_com mean=2.24m best=0.21m (FLEE_DIST=7m) min_com_to_pen mean=12.87m best=3.90m reward/step (mean): progress=-0.0007 alignment=+0.0000 pen_bonus=+0.0005 step_cost=-0.0200 complete=+0.0000 [Curriculum] leaving stage n_sheep=3 after 600,000 steps | training success rate (last 100 eps) = 0% [Curriculum] → 4 sheep at step 1,800,000 -------------------------------- | time/ | | | fps | 1743 | | iterations | 110 | | time_elapsed | 1033 | | total_timesteps | 1802240 | -------------------------------- ----------------------------------------- | time/ | | | fps | 1749 | | iterations | 111 | | time_elapsed | 1039 | | total_timesteps | 1818624 | | train/ | | | approx_kl | 0.009158293 | | clip_fraction | 0.0991 | | clip_range | 0.2 | | entropy_loss | -3.15 | | explained_variance | 0.893 | | learning_rate | 0.0003 | | loss | -0.0414 | | n_updates | 1100 | | policy_gradient_loss | -0.00701 | | std | 1.17 | | value_loss | 0.0237 | ----------------------------------------- ----------------------------------------- | time/ | | | fps | 1755 | | iterations | 112 | | time_elapsed | 1045 | | total_timesteps | 1835008 | | train/ | | | approx_kl | 0.007241189 | | clip_fraction | 0.0831 | | clip_range | 0.2 | | entropy_loss | -3.15 | | explained_variance | 0.874 | | learning_rate | 0.0003 | | loss | -0.0241 | | n_updates | 1110 | | policy_gradient_loss | -0.00634 | | std | 1.17 | | value_loss | 0.0226 | ----------------------------------------- Eval num_timesteps=1850000, episode_reward=-29.45 +/- 31.10 Episode length: 2000.00 +/- 0.00 --------------------------------------- | eval/ | | | mean_ep_length | 2e+03 | | mean_reward | -29.5 | | time/ | | | total_timesteps | 1850000 | | train/ | | | approx_kl | 0.0078688 | | clip_fraction | 0.0777 | | clip_range | 0.2 | | entropy_loss | -3.15 | | explained_variance | 0.895 | | learning_rate | 0.0003 | | loss | -0.036 | | n_updates | 1120 | | policy_gradient_loss | -0.00602 | | std | 1.17 | | value_loss | 0.0128 | --------------------------------------- -------------------------------- | time/ | | | fps | 1742 | | iterations | 113 | | time_elapsed | 1062 | | total_timesteps | 1851392 | -------------------------------- ----------------------------------------- | time/ | | | fps | 1749 | | iterations | 114 | | time_elapsed | 1067 | | total_timesteps | 1867776 | | train/ | | | approx_kl | 0.008158936 | | clip_fraction | 0.0963 | | clip_range | 0.2 | | entropy_loss | -3.14 | | explained_variance | 0.897 | | learning_rate | 0.0003 | | loss | -0.0324 | | n_updates | 1130 | | policy_gradient_loss | -0.00854 | | std | 1.17 | | value_loss | 0.0144 | ----------------------------------------- ------------------------------------------ | time/ | | | fps | 1754 | | iterations | 115 | | time_elapsed | 1073 | | total_timesteps | 1884160 | | train/ | | | approx_kl | 0.0074978825 | | clip_fraction | 0.0844 | | clip_range | 0.2 | | entropy_loss | -3.14 | | explained_variance | 0.92 | | learning_rate | 0.0003 | | loss | -0.0246 | | n_updates | 1140 | | policy_gradient_loss | -0.00578 | | std | 1.16 | | value_loss | 0.0134 | ------------------------------------------ Eval num_timesteps=1900000, episode_reward=-38.21 +/- 31.08 Episode length: 2000.00 +/- 0.00 ---------------------------------------- | eval/ | | | mean_ep_length | 2e+03 | | mean_reward | -38.2 | | time/ | | | total_timesteps | 1900000 | | train/ | | | approx_kl | 0.00678163 | | clip_fraction | 0.0711 | | clip_range | 0.2 | | entropy_loss | -3.15 | | explained_variance | 0.892 | | learning_rate | 0.0003 | | loss | -0.0345 | | n_updates | 1150 | | policy_gradient_loss | -0.00409 | | std | 1.18 | | value_loss | 0.0221 | ---------------------------------------- -------------------------------- | time/ | | | fps | 1740 | | iterations | 116 | | time_elapsed | 1091 | | total_timesteps | 1900544 | -------------------------------- ----------------------------------------- | time/ | | | fps | 1746 | | iterations | 117 | | time_elapsed | 1097 | | total_timesteps | 1916928 | | train/ | | | approx_kl | 0.006992462 | | clip_fraction | 0.0731 | | clip_range | 0.2 | | entropy_loss | -3.16 | | explained_variance | 0.895 | | learning_rate | 0.0003 | | loss | -0.0243 | | n_updates | 1160 | | policy_gradient_loss | -0.00588 | | std | 1.18 | | value_loss | 0.0145 | ----------------------------------------- ------------------------------------------ | time/ | | | fps | 1750 | | iterations | 118 | | time_elapsed | 1104 | | total_timesteps | 1933312 | | train/ | | | approx_kl | 0.0069225584 | | clip_fraction | 0.068 | | clip_range | 0.2 | | entropy_loss | -3.15 | | explained_variance | 0.905 | | learning_rate | 0.0003 | | loss | -0.0297 | | n_updates | 1170 | | policy_gradient_loss | -0.00516 | | std | 1.17 | | value_loss | 0.0153 | ------------------------------------------ ----------------------------------------- | time/ | | | fps | 1756 | | iterations | 119 | | time_elapsed | 1109 | | total_timesteps | 1949696 | | train/ | | | approx_kl | 0.005966103 | | clip_fraction | 0.059 | | clip_range | 0.2 | | entropy_loss | -3.15 | | explained_variance | 0.896 | | learning_rate | 0.0003 | | loss | -0.0337 | | n_updates | 1180 | | policy_gradient_loss | -0.00413 | | std | 1.17 | | value_loss | 0.0091 | ----------------------------------------- Eval num_timesteps=1950000, episode_reward=-59.72 +/- 38.15 Episode length: 2000.00 +/- 0.00 ------------------------------------------ | eval/ | | | mean_ep_length | 2e+03 | | mean_reward | -59.7 | | time/ | | | total_timesteps | 1950000 | | train/ | | | approx_kl | 0.0067311125 | | clip_fraction | 0.0733 | | clip_range | 0.2 | | entropy_loss | -3.16 | | explained_variance | 0.861 | | learning_rate | 0.0003 | | loss | -0.0147 | | n_updates | 1190 | | policy_gradient_loss | -0.00459 | | std | 1.18 | | value_loss | 0.0083 | ------------------------------------------ [Diag @ 1,950,000 | n_sheep=4 | success=0%] NEVER_COMPACT 14/20 COMPACT_CANT_DRIVE 6/20 action_mag mean=0.325 p10=0.025 p90=0.778 (0=stopped, 1=full speed) min_flock_radius mean=7.27m best=2.17m (target <5m to compact) min_dog_to_com mean=3.74m best=0.07m (FLEE_DIST=7m) min_com_to_pen mean=13.01m best=6.24m reward/step (mean): progress=+0.0026 alignment=+0.0000 pen_bonus=+0.0005 step_cost=-0.0200 complete=+0.0000 -------------------------------- | time/ | | | fps | 1728 | | iterations | 120 | | time_elapsed | 1137 | | total_timesteps | 1966080 | -------------------------------- ------------------------------------------ | time/ | | | fps | 1734 | | iterations | 121 | | time_elapsed | 1143 | | total_timesteps | 1982464 | | train/ | | | approx_kl | 0.0061555626 | | clip_fraction | 0.0631 | | clip_range | 0.2 | | entropy_loss | -3.17 | | explained_variance | 0.932 | | learning_rate | 0.0003 | | loss | -0.0328 | | n_updates | 1200 | | policy_gradient_loss | -0.00446 | | std | 1.19 | | value_loss | 0.0133 | ------------------------------------------ ------------------------------------------ | time/ | | | fps | 1739 | | iterations | 122 | | time_elapsed | 1149 | | total_timesteps | 1998848 | | train/ | | | approx_kl | 0.0060347347 | | clip_fraction | 0.057 | | clip_range | 0.2 | | entropy_loss | -3.18 | | explained_variance | 0.841 | | learning_rate | 0.0003 | | loss | -0.0352 | | n_updates | 1210 | | policy_gradient_loss | -0.00322 | | std | 1.19 | | value_loss | 0.0104 | ------------------------------------------ Eval num_timesteps=2000000, episode_reward=-37.97 +/- 46.26 Episode length: 2000.00 +/- 0.00 ------------------------------------------ | eval/ | | | mean_ep_length | 2e+03 | | mean_reward | -38 | | time/ | | | total_timesteps | 2000000 | | train/ | | | approx_kl | 0.0063244104 | | clip_fraction | 0.0675 | | clip_range | 0.2 | | entropy_loss | -3.18 | | explained_variance | 0.865 | | learning_rate | 0.0003 | | loss | -0.0217 | | n_updates | 1220 | | policy_gradient_loss | -0.00489 | | std | 1.2 | | value_loss | 0.0219 | ------------------------------------------ -------------------------------- | time/ | | | fps | 1725 | | iterations | 123 | | time_elapsed | 1167 | | total_timesteps | 2015232 | -------------------------------- ----------------------------------------- | time/ | | | fps | 1730 | | iterations | 124 | | time_elapsed | 1173 | | total_timesteps | 2031616 | | train/ | | | approx_kl | 0.007022621 | | clip_fraction | 0.0816 | | clip_range | 0.2 | | entropy_loss | -3.19 | | explained_variance | 0.949 | | learning_rate | 0.0003 | | loss | -0.0248 | | n_updates | 1230 | | policy_gradient_loss | -0.0053 | | std | 1.19 | | value_loss | 0.00677 | ----------------------------------------- ----------------------------------------- | time/ | | | fps | 1735 | | iterations | 125 | | time_elapsed | 1179 | | total_timesteps | 2048000 | | train/ | | | approx_kl | 0.006686856 | | clip_fraction | 0.0653 | | clip_range | 0.2 | | entropy_loss | -3.18 | | explained_variance | 0.928 | | learning_rate | 0.0003 | | loss | -0.0333 | | n_updates | 1240 | | policy_gradient_loss | -0.00445 | | std | 1.19 | | value_loss | 0.00651 | ----------------------------------------- Eval num_timesteps=2050000, episode_reward=-27.67 +/- 36.42 Episode length: 2000.00 +/- 0.00 ----------------------------------------- | eval/ | | | mean_ep_length | 2e+03 | | mean_reward | -27.7 | | time/ | | | total_timesteps | 2050000 | | train/ | | | approx_kl | 0.006721792 | | clip_fraction | 0.0675 | | clip_range | 0.2 | | entropy_loss | -3.2 | | explained_variance | 0.921 | | learning_rate | 0.0003 | | loss | -0.0278 | | n_updates | 1250 | | policy_gradient_loss | -0.00408 | | std | 1.21 | | value_loss | 0.00793 | ----------------------------------------- -------------------------------- | time/ | | | fps | 1721 | | iterations | 126 | | time_elapsed | 1198 | | total_timesteps | 2064384 | -------------------------------- ----------------------------------------- | time/ | | | fps | 1726 | | iterations | 127 | | time_elapsed | 1205 | | total_timesteps | 2080768 | | train/ | | | approx_kl | 0.006730888 | | clip_fraction | 0.0617 | | clip_range | 0.2 | | entropy_loss | -3.23 | | explained_variance | 0.911 | | learning_rate | 0.0003 | | loss | -0.0276 | | n_updates | 1260 | | policy_gradient_loss | -0.00378 | | std | 1.22 | | value_loss | 0.00964 | ----------------------------------------- ----------------------------------------- | time/ | | | fps | 1732 | | iterations | 128 | | time_elapsed | 1210 | | total_timesteps | 2097152 | | train/ | | | approx_kl | 0.007725292 | | clip_fraction | 0.0775 | | clip_range | 0.2 | | entropy_loss | -3.23 | | explained_variance | 0.913 | | learning_rate | 0.0003 | | loss | -0.0371 | | n_updates | 1270 | | policy_gradient_loss | -0.006 | | std | 1.22 | | value_loss | 0.0109 | ----------------------------------------- Eval num_timesteps=2100000, episode_reward=-40.56 +/- 44.37 Episode length: 2000.00 +/- 0.00 ------------------------------------------ | eval/ | | | mean_ep_length | 2e+03 | | mean_reward | -40.6 | | time/ | | | total_timesteps | 2100000 | | train/ | | | approx_kl | 0.0067186276 | | clip_fraction | 0.0644 | | clip_range | 0.2 | | entropy_loss | -3.24 | | explained_variance | 0.845 | | learning_rate | 0.0003 | | loss | -0.0357 | | n_updates | 1280 | | policy_gradient_loss | -0.00433 | | std | 1.23 | | value_loss | 0.0263 | ------------------------------------------ [Diag @ 2,100,000 | n_sheep=4 | success=0%] NEVER_COMPACT 12/20 COMPACT_CANT_DRIVE 8/20 action_mag mean=0.384 p10=0.018 p90=0.884 (0=stopped, 1=full speed) min_flock_radius mean=6.36m best=2.11m (target <5m to compact) min_dog_to_com mean=2.94m best=0.40m (FLEE_DIST=7m) min_com_to_pen mean=12.34m best=5.56m reward/step (mean): progress=-0.0084 alignment=+0.0000 pen_bonus=+0.0003 step_cost=-0.0200 complete=+0.0000 -------------------------------- | time/ | | | fps | 1706 | | iterations | 129 | | time_elapsed | 1238 | | total_timesteps | 2113536 | -------------------------------- ----------------------------------------- | time/ | | | fps | 1712 | | iterations | 130 | | time_elapsed | 1243 | | total_timesteps | 2129920 | | train/ | | | approx_kl | 0.006317258 | | clip_fraction | 0.0623 | | clip_range | 0.2 | | entropy_loss | -3.26 | | explained_variance | 0.912 | | learning_rate | 0.0003 | | loss | -0.0419 | | n_updates | 1290 | | policy_gradient_loss | -0.00427 | | std | 1.24 | | value_loss | 0.00859 | ----------------------------------------- ---------------------------------------- | time/ | | | fps | 1716 | | iterations | 131 | | time_elapsed | 1250 | | total_timesteps | 2146304 | | train/ | | | approx_kl | 0.00636432 | | clip_fraction | 0.0698 | | clip_range | 0.2 | | entropy_loss | -3.28 | | explained_variance | 0.851 | | learning_rate | 0.0003 | | loss | -0.0266 | | n_updates | 1300 | | policy_gradient_loss | -0.00374 | | std | 1.25 | | value_loss | 0.0299 | ---------------------------------------- Eval num_timesteps=2150000, episode_reward=-63.32 +/- 33.74 Episode length: 2000.00 +/- 0.00 ------------------------------------------ | eval/ | | | mean_ep_length | 2e+03 | | mean_reward | -63.3 | | time/ | | | total_timesteps | 2150000 | | train/ | | | approx_kl | 0.0060345423 | | clip_fraction | 0.0563 | | clip_range | 0.2 | | entropy_loss | -3.27 | | explained_variance | 0.898 | | learning_rate | 0.0003 | | loss | -0.0404 | | n_updates | 1310 | | policy_gradient_loss | -0.00356 | | std | 1.24 | | value_loss | 0.0205 | ------------------------------------------ -------------------------------- | time/ | | | fps | 1704 | | iterations | 132 | | time_elapsed | 1268 | | total_timesteps | 2162688 | -------------------------------- ----------------------------------------- | time/ | | | fps | 1709 | | iterations | 133 | | time_elapsed | 1274 | | total_timesteps | 2179072 | | train/ | | | approx_kl | 0.007027424 | | clip_fraction | 0.0693 | | clip_range | 0.2 | | entropy_loss | -3.25 | | explained_variance | 0.9 | | learning_rate | 0.0003 | | loss | -0.0315 | | n_updates | 1320 | | policy_gradient_loss | -0.00521 | | std | 1.23 | | value_loss | 0.0194 | ----------------------------------------- ----------------------------------------- | time/ | | | fps | 1715 | | iterations | 134 | | time_elapsed | 1279 | | total_timesteps | 2195456 | | train/ | | | approx_kl | 0.006112649 | | clip_fraction | 0.0635 | | clip_range | 0.2 | | entropy_loss | -3.24 | | explained_variance | 0.957 | | learning_rate | 0.0003 | | loss | -0.0339 | | n_updates | 1330 | | policy_gradient_loss | -0.00383 | | std | 1.23 | | value_loss | 0.00861 | ----------------------------------------- Eval num_timesteps=2200000, episode_reward=-31.28 +/- 44.80 Episode length: 2000.00 +/- 0.00 ------------------------------------------ | eval/ | | | mean_ep_length | 2e+03 | | mean_reward | -31.3 | | time/ | | | total_timesteps | 2200000 | | train/ | | | approx_kl | 0.0070182728 | | clip_fraction | 0.076 | | clip_range | 0.2 | | entropy_loss | -3.26 | | explained_variance | 0.883 | | learning_rate | 0.0003 | | loss | -0.0412 | | n_updates | 1340 | | policy_gradient_loss | -0.00534 | | std | 1.25 | | value_loss | 0.013 | ------------------------------------------ -------------------------------- | time/ | | | fps | 1704 | | iterations | 135 | | time_elapsed | 1297 | | total_timesteps | 2211840 | -------------------------------- ------------------------------------------ | time/ | | | fps | 1708 | | iterations | 136 | | time_elapsed | 1304 | | total_timesteps | 2228224 | | train/ | | | approx_kl | 0.0062820893 | | clip_fraction | 0.062 | | clip_range | 0.2 | | entropy_loss | -3.26 | | explained_variance | 0.924 | | learning_rate | 0.0003 | | loss | -0.0377 | | n_updates | 1350 | | policy_gradient_loss | -0.00497 | | std | 1.24 | | value_loss | 0.00797 | ------------------------------------------ ------------------------------------------ | time/ | | | fps | 1713 | | iterations | 137 | | time_elapsed | 1310 | | total_timesteps | 2244608 | | train/ | | | approx_kl | 0.0072454046 | | clip_fraction | 0.0747 | | clip_range | 0.2 | | entropy_loss | -3.25 | | explained_variance | 0.94 | | learning_rate | 0.0003 | | loss | -0.0366 | | n_updates | 1360 | | policy_gradient_loss | -0.00572 | | std | 1.23 | | value_loss | 0.00852 | ------------------------------------------ Eval num_timesteps=2250000, episode_reward=-36.00 +/- 38.67 Episode length: 2000.00 +/- 0.00 ----------------------------------------- | eval/ | | | mean_ep_length | 2e+03 | | mean_reward | -36 | | time/ | | | total_timesteps | 2250000 | | train/ | | | approx_kl | 0.005690419 | | clip_fraction | 0.0546 | | clip_range | 0.2 | | entropy_loss | -3.25 | | explained_variance | 0.957 | | learning_rate | 0.0003 | | loss | -0.0376 | | n_updates | 1370 | | policy_gradient_loss | -0.00425 | | std | 1.23 | | value_loss | 0.00524 | ----------------------------------------- [Diag @ 2,250,000 | n_sheep=4 | success=0%] NEVER_COMPACT 13/20 COMPACT_CANT_DRIVE 7/20 action_mag mean=0.416 p10=0.038 p90=0.887 (0=stopped, 1=full speed) min_flock_radius mean=6.62m best=2.03m (target <5m to compact) min_dog_to_com mean=3.54m best=0.40m (FLEE_DIST=7m) min_com_to_pen mean=14.24m best=9.65m reward/step (mean): progress=-0.0070 alignment=+0.0000 pen_bonus=+0.0005 step_cost=-0.0200 complete=+0.0000 -------------------------------- | time/ | | | fps | 1690 | | iterations | 138 | | time_elapsed | 1337 | | total_timesteps | 2260992 | -------------------------------- ------------------------------------------ | time/ | | | fps | 1696 | | iterations | 139 | | time_elapsed | 1342 | | total_timesteps | 2277376 | | train/ | | | approx_kl | 0.0072061084 | | clip_fraction | 0.0728 | | clip_range | 0.2 | | entropy_loss | -3.25 | | explained_variance | 0.954 | | learning_rate | 0.0003 | | loss | -0.0312 | | n_updates | 1380 | | policy_gradient_loss | -0.00512 | | std | 1.23 | | value_loss | 0.006 | ------------------------------------------ ------------------------------------------ | time/ | | | fps | 1702 | | iterations | 140 | | time_elapsed | 1347 | | total_timesteps | 2293760 | | train/ | | | approx_kl | 0.0066916933 | | clip_fraction | 0.0626 | | clip_range | 0.2 | | entropy_loss | -3.24 | | explained_variance | 0.939 | | learning_rate | 0.0003 | | loss | -0.0408 | | n_updates | 1390 | | policy_gradient_loss | -0.00463 | | std | 1.23 | | value_loss | 0.00827 | ------------------------------------------ Eval num_timesteps=2300000, episode_reward=-43.65 +/- 42.86 Episode length: 2000.00 +/- 0.00 ------------------------------------------ | eval/ | | | mean_ep_length | 2e+03 | | mean_reward | -43.7 | | time/ | | | total_timesteps | 2300000 | | train/ | | | approx_kl | 0.0062987795 | | clip_fraction | 0.0609 | | clip_range | 0.2 | | entropy_loss | -3.26 | | explained_variance | 0.898 | | learning_rate | 0.0003 | | loss | -0.0316 | | n_updates | 1400 | | policy_gradient_loss | -0.00442 | | std | 1.25 | | value_loss | 0.00955 | ------------------------------------------ -------------------------------- | time/ | | | fps | 1691 | | iterations | 141 | | time_elapsed | 1365 | | total_timesteps | 2310144 | -------------------------------- ----------------------------------------- | time/ | | | fps | 1696 | | iterations | 142 | | time_elapsed | 1371 | | total_timesteps | 2326528 | | train/ | | | approx_kl | 0.005443076 | | clip_fraction | 0.054 | | clip_range | 0.2 | | entropy_loss | -3.27 | | explained_variance | 0.877 | | learning_rate | 0.0003 | | loss | -0.0296 | | n_updates | 1410 | | policy_gradient_loss | -0.00375 | | std | 1.24 | | value_loss | 0.00928 | ----------------------------------------- ----------------------------------------- | time/ | | | fps | 1701 | | iterations | 143 | | time_elapsed | 1376 | | total_timesteps | 2342912 | | train/ | | | approx_kl | 0.004740049 | | clip_fraction | 0.0456 | | clip_range | 0.2 | | entropy_loss | -3.26 | | explained_variance | 0.922 | | learning_rate | 0.0003 | | loss | -0.0318 | | n_updates | 1420 | | policy_gradient_loss | -0.00351 | | std | 1.24 | | value_loss | 0.0156 | ----------------------------------------- Eval num_timesteps=2350000, episode_reward=-37.57 +/- 37.78 Episode length: 2000.00 +/- 0.00 ------------------------------------------ | eval/ | | | mean_ep_length | 2e+03 | | mean_reward | -37.6 | | time/ | | | total_timesteps | 2350000 | | train/ | | | approx_kl | 0.0056120222 | | clip_fraction | 0.0542 | | clip_range | 0.2 | | entropy_loss | -3.27 | | explained_variance | 0.911 | | learning_rate | 0.0003 | | loss | -0.0272 | | n_updates | 1430 | | policy_gradient_loss | -0.0035 | | std | 1.25 | | value_loss | 0.00811 | ------------------------------------------ -------------------------------- | time/ | | | fps | 1690 | | iterations | 144 | | time_elapsed | 1395 | | total_timesteps | 2359296 | -------------------------------- ------------------------------------------ | time/ | | | fps | 1695 | | iterations | 145 | | time_elapsed | 1401 | | total_timesteps | 2375680 | | train/ | | | approx_kl | 0.0064737825 | | clip_fraction | 0.0697 | | clip_range | 0.2 | | entropy_loss | -3.28 | | explained_variance | 0.93 | | learning_rate | 0.0003 | | loss | -0.036 | | n_updates | 1440 | | policy_gradient_loss | -0.00403 | | std | 1.25 | | value_loss | 0.00488 | ------------------------------------------ ------------------------------------------ | time/ | | | fps | 1699 | | iterations | 146 | | time_elapsed | 1407 | | total_timesteps | 2392064 | | train/ | | | approx_kl | 0.0050720195 | | clip_fraction | 0.0466 | | clip_range | 0.2 | | entropy_loss | -3.29 | | explained_variance | 0.902 | | learning_rate | 0.0003 | | loss | -0.0374 | | n_updates | 1450 | | policy_gradient_loss | -0.00283 | | std | 1.26 | | value_loss | 0.00958 | ------------------------------------------ Eval num_timesteps=2400000, episode_reward=-42.55 +/- 37.89 Episode length: 2000.00 +/- 0.00 ----------------------------------------- | eval/ | | | mean_ep_length | 2e+03 | | mean_reward | -42.6 | | time/ | | | total_timesteps | 2400000 | | train/ | | | approx_kl | 0.005990128 | | clip_fraction | 0.0565 | | clip_range | 0.2 | | entropy_loss | -3.31 | | explained_variance | 0.869 | | learning_rate | 0.0003 | | loss | -0.0448 | | n_updates | 1460 | | policy_gradient_loss | -0.0051 | | std | 1.27 | | value_loss | 0.00854 | ----------------------------------------- [Diag @ 2,400,000 | n_sheep=4 | success=0%] NEVER_COMPACT 15/20 COMPACT_CANT_DRIVE 5/20 action_mag mean=0.424 p10=0.025 p90=0.948 (0=stopped, 1=full speed) min_flock_radius mean=7.66m best=1.63m (target <5m to compact) min_dog_to_com mean=4.77m best=0.32m (FLEE_DIST=7m) min_com_to_pen mean=14.47m best=8.96m reward/step (mean): progress=-0.0008 alignment=+0.0000 pen_bonus=+0.0003 step_cost=-0.0200 complete=+0.0000 -------------------------------- | time/ | | | fps | 1677 | | iterations | 147 | | time_elapsed | 1435 | | total_timesteps | 2408448 | -------------------------------- Training complete. Artefacts saved to runs/ppo_fix_check/