A2A Failure Isolation — Belgavi.AI Lab

When one agent in your multi-agent system fails, the failure shouldn't cascade. Circuit breakers, graceful degradation, and bounded retries — patterns from microservice resilience — apply directly to A2A and are often missing in early implementations.

Advertisement

Circuit breaker per agent

Track success rate over a sliding window. Below threshold (e.g., 50% success): open circuit, stop calling for a cool-down period. Probe occasionally to detect recovery. Standard pattern; underused in early A2A code.

Bulkhead pattern

Limit concurrent calls per remote agent. One slow agent should consume at most N threads/connections, not all of them. Otherwise one slow downstream takes down your service.

Advertisement

Graceful degradation

When a non-critical agent fails: continue with reduced functionality. 'Search results are basic right now' beats 'search is down'. The user might not notice; the critical path stays alive.

Bounded retries

Retry on transient failures (5xx, timeouts) — but bounded: 1-3 attempts max, exponential backoff. Don't retry indefinitely. Don't retry on 4xx (your bug, not theirs). Surface remaining failures fast.

Health probing

Active health checks: periodically call the agent's health endpoint. Faster detection of failures than waiting for real call to fail. Trade-off: extra cost vs faster failure detection. Worth it for critical dependencies.

Circuit breakers, bulkheads, graceful degradation, bounded retries, health probing. Standard resilience patterns; apply directly to A2A.