Model weights live as tensors in memory and as files on disk. The disk format determines load speed, mmap-ability, and quantization support. Three formats dominate: SafeTensors, GGUF, and PyTorch's native .pt/.bin. Knowing them helps you debug 'why does loading take 30 seconds'.
SafeTensors
[8-byte header length, little-endian]
[JSON header with tensor metadata]
[raw tensor bytes, contiguous, aligned]Hugging Face's format. Single file, no Python pickle (avoids arbitrary code execution). Mmap-friendly. Tensors stored in defined order with shapes, dtypes, offsets. Used as the default for most Hugging Face model uploads since 2023.
GGUF — for llama.cpp
[magic 'GGUF' + version]
[metadata KV pairs: arch, hparams, tokenizer]
[tensor info: name, shape, dtype, offset]
[aligned tensor data]Self-describing: metadata, tokenizer, and weights in one file. Mmap-friendly. Supports quantization variants (Q4_K_M, Q5_K_M, etc.) inline. Used by llama.cpp, Ollama, LM Studio. The de facto local LLM format.
PyTorch native (.bin / .pt)
Uses Python's pickle protocol. Can execute arbitrary code on load — security risk. Slower than SafeTensors. Tensors stored individually with file naming convention (pytorch_model.bin, pytorch_model-00001-of-00004.bin). Being phased out in favor of SafeTensors.
Loading speed: mmap is the trick
# Naive: read all bytes into memory, then construct tensors
# Slow for big models, requires 2× RAM during load
# mmap: map file directly into virtual memory
weights = mmap_open('model.safetensors')
tensor = create_tensor_view(weights, offset, shape)
# No copy — pages fault in on accessmmap means cold-start time = time to read the bytes you actually use, paged in by the OS. For sparse inference (per-token, you touch all weights anyway), full model loads. But: process start is fast even on slow disk.
Sharding for big models
Models > a few GB are sharded across multiple files. Hugging Face: model-00001-of-00007.safetensors + an index file mapping tensor names to shard files. SafeTensors and GGUF both support sharding. Allows partial download / load.