A real-time chat system at million-user scale is a teaching case for almost every distributed-system primitive — connection management, fan-out, ordering, presence, storage, and offline delivery. Let's walk through the design choices and the trade-offs at each layer.

Advertisement

Connection layer

WebSocket terminators behind a load balancer with sticky sessions. Rule of thumb: ~10,000 concurrent connections per termination process (GC pauses become noticeable above this). So 1M concurrent = ~100 termination nodes. Use a connection registry (Redis) mapping user_id → node_id.

Fan-out: pull vs push vs hybrid

Push: when user A sends to group G, server writes the message to N inboxes (one per group member). Great read latency, expensive write for large groups. Pull: store one copy, all members read from the group's timeline. Cheap write, expensive read at scale. Hybrid: push for groups < 100 members, pull for larger. WhatsApp uses hybrid; Twitter timelines use hybrid.

Advertisement

Ordering guarantees

Per-conversation total order is essential — humans notice when messages appear out of order. Use a monotonic sequence number assigned by the broker (Kafka offset works) or a Lamport timestamp if you must distribute. Global order across conversations is unnecessary and expensive — don't claim it.

Message storage

Hot tier (last 30 days): in-memory or SSD-backed, indexed by (conversation_id, sequence). Warm tier (30d-1yr): SSD, compressed. Cold tier (>1yr): object storage (S3) with manifest in DB. Cassandra works well for the hot+warm tiers because of its time-series-friendly clustering.

Presence + typing indicators

Don't persist presence to disk — it's ephemeral. Keep in Redis with a 30-second TTL refreshed on every heartbeat. Typing indicators: send only when the user starts typing and again after a 3-second silence — never every keystroke. Saves 95% of presence traffic.

Offline delivery

When the recipient is offline, queue messages in a per-user inbox (Kafka consumer group keyed by user_id, or Cassandra). On reconnect, the client asks for messages after its last_seen_seq. Don't push notifications to every offline user — most are idle for weeks; let APNs/FCM handle the wake-up.

Connections → fan-out hybrid → tiered storage → presence in Redis with TTL. Each layer scales independently.