In single-agent systems, failures are rare and obvious. In multi-agent systems, they're frequent and subtle: called agent returns plausible-looking garbage, takes 30 seconds when 2 is expected, or returns success but did nothing. Recovery is a first-class design concern.

Advertisement

Timeouts everywhere

Every A2A call has a deadline propagated from the original request. No infinite waits. If a called agent doesn't respond, fall back: retry once, then degrade gracefully or surface the failure.

Verification calls

Critical actions (payment, account changes) call back to verify completion. 'Did you actually transfer $100?' rather than trusting the response. Adds latency but catches lies and silent failures.

Advertisement

Idempotency keys

Same key as a retry safety net. Called agent sees same idempotency key, returns same response, doesn't duplicate side effects. The agent that calls must include the key in its request envelope.

Circuit breakers per agent

If agent X has failed 5 times in 30 seconds, stop calling for 60 seconds. Surface 'X is unavailable' to upstream rather than hanging. Standard SRE pattern applied to agent calls.

Quality monitoring

Track success rates, latency p99, and output-quality signals (downstream rejection rate, user-reported issues) per agent. Agents degrade silently — observability catches it.

Timeouts + verification + idempotency + circuit breakers + quality monitoring. Treat A2A calls like any flaky network dependency.