Failure Detectors and Phi Accrual

Distributed systems need to decide 'is that node dead'. Fixed timeouts have two failure modes: too short → false positives (flapping), too long → slow failure detection. Phi-accrual failure detectors adapt to actual inter-arrival statistics and give a continuous suspicion score instead of binary alive/dead.

Advertisement

Fixed timeout pitfalls

Set timeout = 5s. Network blip > 5s → node falsely marked dead, leader election, churn. Set timeout = 60s → real failures take a minute to detect. There's no fixed value that's both fast and accurate. The threshold is the wrong abstraction.

Phi accrual model

Track inter-arrival times of heartbeats. Compute phi = -log10(P(arrival > now - last_arrival)). High phi = unlikely to be alive given history. Choose action threshold (e.g., phi > 8 = treat as dead). Adapts: stable network has low phi, jittery network has higher phi naturally.

Advertisement

Tuning

Sliding window of N recent inter-arrival samples (typically 1000). Threshold typically 8-12. Higher threshold = slower detection, fewer false positives. Lower = faster detection, more flapping. Pick by your acceptable false-positive rate.

Integration

Phi-accrual outputs a suspicion score, not a binary. Leader election can use 'phi > X for K seconds'. Hinted handoff in Cassandra uses phi to decide when to start storing hints. Gossip protocols use it to mark nodes as 'down' for membership.

Real-world gotchas

Application-level GC pauses look like network failures to the detector. Long pauses → high phi → false failure detection. Solution: heartbeat from a dedicated thread, or correlate with GC events. Cassandra users have all hit this.

Fixed timeouts are wrong. Phi accrual adapts to network behavior, gives continuous suspicion, integrates cleanly into leader election and membership.