Generating video thumbnails at scale is a small system on its own: extract candidate frames, score for quality, smart-crop to aspect ratios, encode to WebP/AVIF, deliver via CDN. Each stage has gotchas — especially when serving Netflix-scale catalogs.

Advertisement

Frame extraction

FFmpeg per second: ffmpeg -i input.mp4 -vf fps=1 thumb-%03d.jpg. Costs CPU proportional to video length. For long videos, scene detection (-vf select='gt(scene,0.4)') extracts only meaningful frames.

Smart frame selection

Not all frames are good thumbnails. Score each candidate by: face presence (deep face detector), sharpness (Laplacian variance), brightness, lack of motion blur. Top-N candidates → human review (rare) or auto-pick best.

Advertisement

Smart crop for aspect ratios

Thumbnail needed in 16:9 (player), 1:1 (Instagram), 2:3 (poster). Auto-crop to keep the salient region (face, action). Use saliency maps from a small model (Yolo, MediaPipe FaceMesh). Center-crop is a poor fallback.

Format + encoding

WebP at quality 80: ~30% smaller than JPEG, near-identical quality, universal browser support. AVIF: another 20% smaller, but slower to encode and Safari support is recent. Generate both formats; serve via <picture>.

CDN delivery

Pre-render common sizes (320×180, 640×360, 1280×720). Origin generates on first request, CDN caches. Use signed URLs to prevent hot-linking. Set long TTL (90 days+) since thumbnails rarely change.

Scene-detected frames → quality scoring → smart-crop → WebP+AVIF → CDN with long TTL. Each step worth automating.