Distributed systems need to decide 'is that node dead'. Fixed timeouts have two failure modes: too short → false positives (flapping), too long → slow failure detection. Phi-accrual failure detectors adapt to actual inter-arrival statistics and give a continuous suspicion score instead of binary alive/dead.
Fixed timeout pitfalls
Set timeout = 5s. Network blip > 5s → node falsely marked dead, leader election, churn. Set timeout = 60s → real failures take a minute to detect. There's no fixed value that's both fast and accurate. The threshold is the wrong abstraction.
Phi accrual model
Track inter-arrival times of heartbeats. Compute phi = -log10(P(arrival > now - last_arrival)). High phi = unlikely to be alive given history. Choose action threshold (e.g., phi > 8 = treat as dead). Adapts: stable network has low phi, jittery network has higher phi naturally.
Tuning
Sliding window of N recent inter-arrival samples (typically 1000). Threshold typically 8-12. Higher threshold = slower detection, fewer false positives. Lower = faster detection, more flapping. Pick by your acceptable false-positive rate.
Integration
Phi-accrual outputs a suspicion score, not a binary. Leader election can use 'phi > X for K seconds'. Hinted handoff in Cassandra uses phi to decide when to start storing hints. Gossip protocols use it to mark nodes as 'down' for membership.
Real-world gotchas
Application-level GC pauses look like network failures to the detector. Long pauses → high phi → false failure detection. Solution: heartbeat from a dedicated thread, or correlate with GC events. Cassandra users have all hit this.