Spot-checking 10 agent traces tells you nothing about quality. Production-scale evaluation is its own discipline — trace replay, LLM-judge calibration, regression catching, drift monitoring. The mature stack has 5-10 layers.

Advertisement

Trace capture and replay

Capture every prompt, response, tool call. Replay against new model/prompt versions. Find regressions before deploy. Tools: Langfuse, Arize Phoenix, OpenLLMetry.

LLM-judge with calibration

Use strong model to grade traces against rubric. Calibrate against human grades on a sample (target >0.7 Pearson correlation). Re-calibrate when judge model changes.

Advertisement

Drift monitoring

Production quality scores tracked over time. Sudden drop = model provider changed something. Slow drop = input distribution shifted. Both need investigation; both are easy to miss without tracking.

Trace + replay + calibrated judge + drift tracking. Spot-check is a starting point, not a finish line.