Dataloader and Tokenization Pipeline

Training spends a non-trivial fraction of time on data loading and tokenization. A bad pipeline starves the training loop; a good one keeps the CPU pegged on math. The components are simple but easy to misconfigure.

Advertisement

The pipeline stages

disk → read → decode → tokenize → batch → device → train
  |       |        |        |          |         |
  IO   parse   decompress   BPE     padding   pin_mem

Each stage has its own cost. Worst offenders: tokenizing on the main loop (single-threaded Python is slow), unbatched IO (lots of small reads), no pinned memory for GPU transfer (slow copy). Profiling reveals bottlenecks quickly.

Multi-worker dataloaders

loader = DataLoader(
    dataset, batch_size=4,
    num_workers=8,           # 8 CPU procs prefetch in parallel
    pin_memory=True,          # GPU transfer hint (no effect on CPU train)
    prefetch_factor=2,        # each worker queues 2 batches ahead
)

Multiple worker processes feed batches into a queue. With enough workers, training never waits for data. Set num_workers ~ CPU cores / 2 (leave room for the main process). Watch for OOM from worker buffers.

Advertisement

Pre-tokenization

# Tokenize the whole dataset once, save to disk:
# - faster training (no per-batch tokenize cost)
# - deterministic batches
# - works with mmap for random access

For pretraining: tokenize the entire corpus to a single big file of int32 IDs. Use mmap to read random spans. ~10× faster than online tokenization. Standard for serious training.

Packing tokens into sequences

# Without packing: sequence per document, lots of padding
# With packing: concatenate documents up to max_seq, separate with EOS
#
# Mask attention to not cross document boundaries (intra-document only)

Pretraining packs documents densely. Massive efficiency gain (no wasted padding). Requires position resets and document-aware attention masking. Production code does this; demos often don't.

CPU-specific considerations

CPU training shares cores between data loading and compute. Worker count is even more important than GPU training. Use background prefetch. Avoid per-batch Python overhead — pre-tokenize, write batches as tensors. Use persistent_workers=True to avoid worker spinup cost each epoch.

Multi-worker dataloaders + pre-tokenization + packing. Avoid Python in the inner loop. Profile to find the bottleneck.