Work stealing is the scheduling algorithm behind many parallel runtimes: Java's ForkJoinPool, Go's scheduler, Tokio's multi-thread runtime, Rayon, .NET's Task Parallel Library. The idea is simple; the careful implementation is what makes it scale.
The problem
M tasks across N worker threads. Static partition (worker K gets tasks K, K+N, K+2N...) hits load imbalance: one worker finishes fast, others are still busy. Centralized queue contends on the head. Work stealing splits the difference.
The algorithm
Each worker has its own deque (double-ended queue). Local pushes/pops on the bottom. When a worker's deque is empty, it steals from another worker's top. Lock-free deque (Chase-Lev) makes the steal cheap.
Why it works
Local work is fast (no contention). Stealing balances load. The deque shape minimizes contention (local end vs steal end are different). Scales to many cores; the algorithm is decades old.
Variants in production runtimes
Tokio multi-threaded: global queue + per-worker queues with LIFO local + FIFO steal. Java ForkJoinPool: classic Chase-Lev deque per worker. Rayon: optimized for CPU-bound parallel ops, splits ranges adaptively. Same core, different tuning.
Where it falls down
Tasks must be balanced enough that stealing helps. Tiny tasks: scheduling overhead dominates. Massive tasks: load imbalance even with stealing. Solutions: chunk small tasks, decompose huge ones. Run profiling; the runtimes have visibility built in.