REINFORCE

∇J(θ) = E[∇log π_θ(a|s) · R]. High variance — use baseline (value function) to reduce.

Advertisement

Actor-critic

Actor: policy π. Critic: value V(s). Advantage A(s,a) = Q(s,a) - V(s) reduces variance.

Advertisement

PPO

Proximal Policy Optimization. Clip policy ratio in gradient update to prevent destructive updates. Modern default.