Safety eval for agents is iterative red-teaming, not a single benchmark run. The threat model evolves as agents get new tools and users get more sophisticated. The practice is closer to security testing than ML evaluation.
Build a red-team test set
Categories: prompt injection (try to override system prompt), capability boundary (try to call unauthorized tools), social engineering (impersonate authority), data exfiltration (try to surface other users' data), harmful content generation. 100-500 examples per category to start.
Automated re-running
Every agent version: re-run the full red-team set. Track pass rate per category. Block deploy on regressions. Same model + same prompt should give same results; variation flags non-determinism issues too.
Human red-team rounds
Quarterly: hire (internal or external) red-team to find new attacks the automated set doesn't cover. Add successful attacks to automated set. The set should grow; never shrink.
Production monitoring
Sample real conversations through safety classifiers. Spike in classifier fires = new attack class in the wild. Investigate and add to test set within a week — every day in production with the gap is exposure.
Don't conflate safety and quality
Quality eval: 'did it answer well?' Safety eval: 'did it refuse what it should refuse and allow what it should allow?' Different signals. A perfectly-safe agent that refuses everything has 100% safety and 0% utility. Track both.