Tom Wilkie's RED (Rate, Errors, Duration) and Brendan Gregg's USE (Utilization, Saturation, Errors) are the two most cited 'what to monitor' frameworks. They answer different questions and are best used together.

Advertisement

RED — for services

Rate: requests per second. Errors: % of failed requests. Duration: latency distribution (p50, p95, p99). One service = three core metrics. If you can't see all three, you can't tell if it's healthy.

USE — for resources

Utilization: % of resource in use (CPU, mem, disk). Saturation: queue depth, waiting work (CPU run queue, I/O wait). Errors: hardware errors, retries. Resource = one node, disk, network interface. Catches bottlenecks before they cause service-level issues.

Advertisement

Why both

RED tells you 'service is slow'. USE tells you 'why' — CPU saturated, disk full, network errored. RED is user-facing, USE is operator-facing. SREs need both; engineers usually only see RED.

Implementation

RED: instrument every endpoint with rate/error/duration histograms (Prometheus' RED dashboards). USE: collect node_exporter / cAdvisor / NVMe metrics. Wire both into the same Grafana — one row per service (RED), one row per node (USE).

The four golden signals

Google SRE book extends RED with Saturation as a fourth signal — bridging the gap. RED + Saturation = 'Four Golden Signals'. Largely a notational difference from RED+USE; concept is the same.

RED per service, USE per resource. Together they cover the 'what + why'. Don't pick one — both are mandatory.