YouTube serves personalized video recommendations to 2B+ users. The core architecture: a two-stage funnel — candidate generation (retrieve ~1000 from billions) followed by ranking (score and order). Both stages use neural networks trained on user behavior.
Stage 1: candidate generation
Goal: from 1B+ videos, retrieve ~1000 candidates for THIS user, in <50ms. Two-tower model: one tower encodes user (history, demographics), another encodes video. Both produce 256-dim embeddings. Retrieval = nearest-neighbor search via ANN (FAISS, ScaNN).
Stage 2: ranking
Goal: score 1000 candidates, return top 20. Use a much richer model: ~100s of features (user-video interactions, watch context, device, etc.). Multi-task learning — predicts (click probability, watch time, satisfaction, share probability) simultaneously. Linear combination weighted by business goals.
Why two stages
Single-stage ranking on 1B videos is impossible (each query would be ~1B model evaluations). Two-stage: retrieval is fast and approximate, ranking is slow and precise. Same pattern used by every large recsys.
Cold start
New user: no history → recommend popular content in their language/country. New video: no engagement signals → recommend to users whose embedding aligns with the video's content embedding (e.g., NLP features from title + transcript).
Online vs offline training
Offline: nightly batch retraining on billion-row logs. Online: real-time feature serving (last-clicked video, current session). Hybrid: stable embeddings learned offline, fast-changing features (recency) injected at serving time.