Agent GRPO Training
Convergence Proofs for Group Relative Policy Optimization
Formal proof that Group Relative Policy Optimization converges to a stable policy under bounded reward variance. Establishes monotonic improvement guarantees and KL-divergence bounds for multi-agent training.