REINFORCE
∇J(θ) = E[∇log π_θ(a|s) · R]. High variance — use baseline (value function) to reduce.
Advertisement
Actor-critic
Actor: policy π. Critic: value V(s). Advantage A(s,a) = Q(s,a) - V(s) reduces variance.
Advertisement
PPO
Proximal Policy Optimization. Clip policy ratio in gradient update to prevent destructive updates. Modern default.