YouTube serves personalized video recommendations to 2B+ users. The core architecture: a two-stage funnel — candidate generation (retrieve ~1000 from billions) followed by ranking (score and order). Both stages use neural networks trained on user behavior.

Advertisement

Stage 1: candidate generation

Goal: from 1B+ videos, retrieve ~1000 candidates for THIS user, in <50ms. Two-tower model: one tower encodes user (history, demographics), another encodes video. Both produce 256-dim embeddings. Retrieval = nearest-neighbor search via ANN (FAISS, ScaNN).

Stage 2: ranking

Goal: score 1000 candidates, return top 20. Use a much richer model: ~100s of features (user-video interactions, watch context, device, etc.). Multi-task learning — predicts (click probability, watch time, satisfaction, share probability) simultaneously. Linear combination weighted by business goals.

Advertisement

Why two stages

Single-stage ranking on 1B videos is impossible (each query would be ~1B model evaluations). Two-stage: retrieval is fast and approximate, ranking is slow and precise. Same pattern used by every large recsys.

Cold start

New user: no history → recommend popular content in their language/country. New video: no engagement signals → recommend to users whose embedding aligns with the video's content embedding (e.g., NLP features from title + transcript).

Online vs offline training

Offline: nightly batch retraining on billion-row logs. Online: real-time feature serving (last-clicked video, current session). Hybrid: stable embeddings learned offline, fast-changing features (recency) injected at serving time.

Two-tower retrieval + multi-task ranking. Cold start via content features. Offline embedding + online features.