OpenTelemetry (OTel) is the open standard replacing vendor-specific instrumentation (Datadog APM, New Relic agents, etc.). One SDK, one collector, any backend. The promise is real but the moving parts are many — here's how they fit.
The three signals
OTel covers traces (request-spanning timelines), metrics (numbers over time), and logs (text events). All three share a common context (trace_id, span_id) so you can pivot between them. This is the killer feature versus legacy stacks.
Instrumentation: auto + manual
Auto-instrumentation: drop in a library, common frameworks emit traces with zero code (Flask, Django, Express, Spring). Manual: wrap your own business logic with tracer.start_as_current_span(). Start with auto, add manual for the 5% of code that matters most.
The Collector
Standalone binary that receives OTLP from your apps, processes (sampling, attribute editing, batching), exports to any backend. Run as sidecar per pod (low overhead, simple) or as a fleet of dedicated nodes (better aggregation, more complex). Sidecar wins for < 100 services.
Backend choice
| Backend | Best for |
|---|---|
| Jaeger | Open-source self-host, traces only |
| Tempo + Prometheus + Loki (Grafana stack) | Self-hosted full-stack |
| Datadog | Hosted, polished UI, expensive |
| Honeycomb | High-cardinality trace querying |
| SigNoz | OTel-native self-host |
Sampling strategy
100% sampling is expensive (every span hits the collector). Tail-based sampling: collect all spans, decide after-the-fact based on outcome (errors, slow requests sampled at 100%, normal at 1%). Head-based: decide at trace start. Tail is more useful but requires the collector to buffer.