A metrics system ingests millions of points per second, lets you query over arbitrary time ranges with low latency, and doesn't cost millions to operate. The architecture: ingest pipeline + time-series store + pre-aggregation rollups + smart query planner.
Ingest pipeline
Agents (Prometheus, Telegraf, OpenTelemetry collectors) push samples to a write-side service. Write service deduplicates, batches, partitions by metric name, writes to time-series store. At 1M points/sec, this stage uses a queue (Kafka or in-process buffer) to absorb bursts.
Time-series store
Specialized for append-heavy + range-query workloads. Prometheus TSDB, InfluxDB, VictoriaMetrics, TimescaleDB. All use compressed columnar layouts (Gorilla, Snappy) and per-series indexes. Query for a metric over a time range is O(log N) to find blocks, then linear scan.
Pre-aggregation rollups
A 24-hour query on raw 1-second samples = 86,400 points per series, slow. Precompute 1-minute, 5-minute, 1-hour rollups (avg/sum/min/max/p99). Query planner picks the coarsest resolution that satisfies the requested range. Sub-second queries even over a year.
Downsampling for retention
Keep raw samples for 7 days, 1-minute rollups for 30 days, 1-hour rollups for 1 year, daily for forever. Storage drops 100x while keeping enough resolution for historical comparison. Configure per-metric — some you keep raw forever (billing).
Cardinality control
'request_count{user_id=alice}' with 10M users = 10M time series. This is what kills metric systems. Hard cap on label cardinality (e.g., max 10K series per metric); reject writes that would exceed. Aggregate high-cardinality dimensions before ingest.