When one agent in your multi-agent system fails, the failure shouldn't cascade. Circuit breakers, graceful degradation, and bounded retries — patterns from microservice resilience — apply directly to A2A and are often missing in early implementations.
Circuit breaker per agent
Track success rate over a sliding window. Below threshold (e.g., 50% success): open circuit, stop calling for a cool-down period. Probe occasionally to detect recovery. Standard pattern; underused in early A2A code.
Bulkhead pattern
Limit concurrent calls per remote agent. One slow agent should consume at most N threads/connections, not all of them. Otherwise one slow downstream takes down your service.
Graceful degradation
When a non-critical agent fails: continue with reduced functionality. 'Search results are basic right now' beats 'search is down'. The user might not notice; the critical path stays alive.
Bounded retries
Retry on transient failures (5xx, timeouts) — but bounded: 1-3 attempts max, exponential backoff. Don't retry indefinitely. Don't retry on 4xx (your bug, not theirs). Surface remaining failures fast.
Health probing
Active health checks: periodically call the agent's health endpoint. Faster detection of failures than waiting for real call to fail. Trade-off: extra cost vs faster failure detection. Worth it for critical dependencies.