Safety eval for agents is iterative red-teaming, not a single benchmark run. The threat model evolves as agents get new tools and users get more sophisticated. The practice is closer to security testing than ML evaluation.

Advertisement

Build a red-team test set

Categories: prompt injection (try to override system prompt), capability boundary (try to call unauthorized tools), social engineering (impersonate authority), data exfiltration (try to surface other users' data), harmful content generation. 100-500 examples per category to start.

Automated re-running

Every agent version: re-run the full red-team set. Track pass rate per category. Block deploy on regressions. Same model + same prompt should give same results; variation flags non-determinism issues too.

Advertisement

Human red-team rounds

Quarterly: hire (internal or external) red-team to find new attacks the automated set doesn't cover. Add successful attacks to automated set. The set should grow; never shrink.

Production monitoring

Sample real conversations through safety classifiers. Spike in classifier fires = new attack class in the wild. Investigate and add to test set within a week — every day in production with the gap is exposure.

Don't conflate safety and quality

Quality eval: 'did it answer well?' Safety eval: 'did it refuse what it should refuse and allow what it should allow?' Different signals. A perfectly-safe agent that refuses everything has 100% safety and 0% utility. Track both.

Red-team test set per category. Automated re-run. Quarterly human red-team. Prod monitoring with safety classifiers. Safety and quality are distinct.