Advanced reinforcement learning algorithm for stable and efficient policy optimization
Prevents overly large policy updates with probability ratio clipping, ensuring stable training.
Provides TRPO-level stability without complex second-order optimization.
Combines policy (actor) and value function (critic) learning for efficient training.
Uses Generalized Advantage Estimation to reduce variance in policy updates.
The core innovation of PPO is its clipped objective function:
LCLIP(θ) = E[min(rt(θ)At, clip(rt(θ), 1-ε, 1+ε)At)]
Where:
rt(θ) is the probability ratio between new and old policiesAt is the advantage estimateε is a hyperparameter (typically 0.1-0.3)| Hyperparameter | Typical Value | Effect |
|---|---|---|
| Learning Rate | 3e-4 to 3e-3 | Higher = faster but less stable learning |
| Clip ε | 0.1 to 0.3 | Smaller = more conservative updates |
| γ (Discount) | 0.99 to 0.999 | Higher = more future-focused |
| GAE λ | 0.9 to 0.99 | Higher = lower variance but more bias |
| Entropy Coeff | 0.01 to 0.05 | Higher = more exploration |
Start with these defaults and adjust based on your specific environment.
| Algorithm | Type | Strengths | Weaknesses |
|---|---|---|---|
| PPO | On-policy | Stable, easy to tune, good for parallelization | Less sample efficient than off-policy methods |
| SAC | Off-policy | Very sample efficient, automatic entropy tuning | More sensitive to hyperparameters |
| DDPG | Off-policy | Good for high-dim continuous actions | Can be unstable, sensitive to hyperparams |
PPO is often the best choice when you value stability and ease of use over maximum sample efficiency.