Trace Sampling Strategies Deep Dive

Every trace costs money to store and query. Sampling is unavoidable above a certain volume. Where and how you sample determines whether the traces you keep are the ones you actually need during an incident.

Advertisement

The cost picture

Storage: $1-10 per million spans at managed backends. At 1B spans/day a 100% sample is $30K-300K/month. Sampling 1% drops it to $300-3000 — same operational utility for SRE debugging if done right.

Head-based sampling

Decide at span start. Random, e.g., keep 10%. Cheap, simple. Problem: misses the interesting traces — errors and slow requests are exactly the ones you want, but they're rare and random sampling biases against them.

Advertisement

Tail-based sampling at the collector

Collect all spans for a trace, decide at trace completion: keep all error traces, all slow traces (>p99), plus N% normal. Requires a tail-sampling collector tier (memory-buffer spans until trace completes). Worth the operational cost.

Adaptive sampling

Adjust sample rate based on traffic volume to hit a budget. Burst traffic auto-drops sample rate. Quiet periods keep more. Cloud SDKs (Datadog, Honeycomb) ship this; OpenTelemetry collector supports via processors.

Sampling for SLO incidents

During an incident, the sample rate set yesterday might miss the trace you need now. Pattern: temporary 100% sampling during ongoing incidents, triggered by alerting. Few hours of expensive sampling is cheap relative to the incident.

Tail-based for production. Adaptive for cost. Temporary 100% during incidents. Head-based only when collector tier isn't an option.