Agent observability is qualitatively different from service observability. You're tracking not just 'is it up' but 'did it reason well?' Per-turn trace + cost + quality signals together give the picture; one alone misleads.
Trace structure
Per conversation: spans for each turn. Per turn: spans for model call, each tool call, response generation. Attributes: tokens in/out, model, tools used, latency, cost. OpenTelemetry-compatible so it integrates with existing infra.
Per-turn cost
Sum: input tokens × input price + output tokens × output price + tool call costs (API calls cost money too). Aggregate across users to find expensive patterns. Common finding: 1% of conversations use 30% of cost.
Quality signals
User feedback (thumbs up/down). LLM-as-judge scores on sampled conversations. Task completion (did the user reach their goal?). Tool-call accuracy (did the right tool get called with right args?). Each is partial; combine.
Sampling strategy
100% trace capture is expensive at scale. Sample by: keep all errors, all flagged-by-user conversations, all high-cost, plus 1% of normal. Eval-set worthy traces get tagged automatically; surface for human review.
Tools
Langfuse, Arize Phoenix, OpenLLMetry, LangSmith, Helicone. All do roughly the same thing at the metrics layer; differentiation in eval workflow integration. Pick by what your team will actually use.