Work Stealing Explained — Belgavi.AI Lab

Work stealing is the scheduling algorithm behind many parallel runtimes: Java's ForkJoinPool, Go's scheduler, Tokio's multi-thread runtime, Rayon, .NET's Task Parallel Library. The idea is simple; the careful implementation is what makes it scale.

Advertisement

The problem

M tasks across N worker threads. Static partition (worker K gets tasks K, K+N, K+2N...) hits load imbalance: one worker finishes fast, others are still busy. Centralized queue contends on the head. Work stealing splits the difference.

The algorithm

Each worker has its own deque (double-ended queue). Local pushes/pops on the bottom. When a worker's deque is empty, it steals from another worker's top. Lock-free deque (Chase-Lev) makes the steal cheap.

Advertisement

Why it works

Local work is fast (no contention). Stealing balances load. The deque shape minimizes contention (local end vs steal end are different). Scales to many cores; the algorithm is decades old.

Variants in production runtimes

Tokio multi-threaded: global queue + per-worker queues with LIFO local + FIFO steal. Java ForkJoinPool: classic Chase-Lev deque per worker. Rayon: optimized for CPU-bound parallel ops, splits ranges adaptively. Same core, different tuning.

Where it falls down

Tasks must be balanced enough that stealing helps. Tiny tasks: scheduling overhead dominates. Massive tasks: load imbalance even with stealing. Solutions: chunk small tasks, decompose huge ones. Run profiling; the runtimes have visibility built in.

Per-worker deques + cheap local + steal-when-idle. Standard in modern parallel runtimes. Profile if balance is suspected.