Training spends a non-trivial fraction of time on data loading and tokenization. A bad pipeline starves the training loop; a good one keeps the CPU pegged on math. The components are simple but easy to misconfigure.
The pipeline stages
disk → read → decode → tokenize → batch → device → train
| | | | | |
IO parse decompress BPE padding pin_memEach stage has its own cost. Worst offenders: tokenizing on the main loop (single-threaded Python is slow), unbatched IO (lots of small reads), no pinned memory for GPU transfer (slow copy). Profiling reveals bottlenecks quickly.
Multi-worker dataloaders
loader = DataLoader(
dataset, batch_size=4,
num_workers=8, # 8 CPU procs prefetch in parallel
pin_memory=True, # GPU transfer hint (no effect on CPU train)
prefetch_factor=2, # each worker queues 2 batches ahead
)Multiple worker processes feed batches into a queue. With enough workers, training never waits for data. Set num_workers ~ CPU cores / 2 (leave room for the main process). Watch for OOM from worker buffers.
Pre-tokenization
# Tokenize the whole dataset once, save to disk:
# - faster training (no per-batch tokenize cost)
# - deterministic batches
# - works with mmap for random accessFor pretraining: tokenize the entire corpus to a single big file of int32 IDs. Use mmap to read random spans. ~10× faster than online tokenization. Standard for serious training.
Packing tokens into sequences
# Without packing: sequence per document, lots of padding
# With packing: concatenate documents up to max_seq, separate with EOS
#
# Mask attention to not cross document boundaries (intra-document only)Pretraining packs documents densely. Massive efficiency gain (no wasted padding). Requires position resets and document-aware attention masking. Production code does this; demos often don't.
CPU-specific considerations
CPU training shares cores between data loading and compute. Worker count is even more important than GPU training. Use background prefetch. Avoid per-batch Python overhead — pre-tokenize, write batches as tensors. Use persistent_workers=True to avoid worker spinup cost each epoch.