GGUF is the file format for quantized models in the llama.cpp ecosystem (Ollama, LM Studio, koboldcpp). Understanding it helps you pick the right quantization variant, debug load failures, and reason about quality trade-offs without reading the source.
File structure
Header with metadata (architecture, vocab, hyperparams). Tensor index with offsets. Tensor data. All metadata is upfront — supports memory-mapped loading without parsing the whole file.
Quantization variants
Q4_K_M, Q5_K_S, Q8_0 — the suffixes encode bit width and 'K' (k-quant) or 'M/S/L' (medium/small/large quality). Higher K-letter = more quality, more disk. Q4_K_M is the most common 'good balance' on 7B-13B models.
K-quants explained
K-quants use mixed-precision: most layers at one bit width, important layers (attention, output) at higher. The 'M' variant prioritizes more important layers up by 1-2 bits. Quality per byte is ~20% better than uniform Q4.
Choosing a variant
Q8_0: nearly lossless (~0.1% quality drop), good for evaluation. Q5_K_M: ~0.5% drop, good for 13B+ models that fit. Q4_K_M: ~1-2% drop, the workhorse for laptop inference. Q3_K_S: noticeable drop; only for memory-constrained.
Loading and inference
mmap loads the file as virtual memory; pages fault in on demand. RAM usage = (model size, after quantization) + (KV cache, depends on context). Slow disks bottleneck cold loads; subsequent inferences are fast.