Posted July 21, 2025 by Shivani
Parameter | Default Model | SearchAgent (Custom) | Description |
trainer_type | ppo | ppo | Same algorithm used |
max_steps | 500,000 | 3,000,000 | Extended training for more learning and convergence |
summary_freq | 50,000 | 10,000 | More frequent summaries for closer monitoring |
HYPERPARAMETERS | ________________ | ________________ | ________________ |
learning_rate | 3e-4 | 3e-4 | No change |
batch_size | 1024 | 1024 | No change |
buffer_size | 10,240 | 10,240 | No change |
beta | not set | 2.5e-4 | Regularisation to reduce policy entropy |
epsilon | not set | 0.2 | PPO clipping parameter to control policy updates |
lambd | not set | 0.95 | GAE lambda for bias-variance trade-off in advantage estimation |
num_epoch | not set | 3 | Number of passes over data per policy update |
learning_rate_schedule | linear | linear | Gradual learning rate decay over training |
NETWORK SETTINGS | ________________ | ________________ | ________________ |
hidden_units | 128 | 256 | Larger network capacity to learn more complex features |
num_layers | 2 | 2 | Same depth for balance between expressiveness and speed |
normalize | false | true | Normalise inputs to stabilise and speed training |
REWARD SIGNALS | ________________ | ________________ | ________________ |
EXTRINSIC | ________________ | ________________ | ________________ |
gamma | 0.99 | 0.99 | No change |
strength | 1.0 | 1.0 | No change |
CURIOSITY | ________________ | ________________ | ________________ |
strength | not set | 0.1 | Added curiosity for intrinsic motivation/exploration |
gamma | not set | 0.99 | Discount factor for curiosity rewards |
learning_rate | not set | 0.0003 | Learning rate specific to curiosity module |