GGUF Format Explained — Belgavi.AI Lab

GGUF is the file format for quantized models in the llama.cpp ecosystem (Ollama, LM Studio, koboldcpp). Understanding it helps you pick the right quantization variant, debug load failures, and reason about quality trade-offs without reading the source.

Advertisement

File structure

Header with metadata (architecture, vocab, hyperparams). Tensor index with offsets. Tensor data. All metadata is upfront — supports memory-mapped loading without parsing the whole file.

Quantization variants

Q4_K_M, Q5_K_S, Q8_0 — the suffixes encode bit width and 'K' (k-quant) or 'M/S/L' (medium/small/large quality). Higher K-letter = more quality, more disk. Q4_K_M is the most common 'good balance' on 7B-13B models.

Advertisement

K-quants explained

K-quants use mixed-precision: most layers at one bit width, important layers (attention, output) at higher. The 'M' variant prioritizes more important layers up by 1-2 bits. Quality per byte is ~20% better than uniform Q4.

Choosing a variant

Q8_0: nearly lossless (~0.1% quality drop), good for evaluation. Q5_K_M: ~0.5% drop, good for 13B+ models that fit. Q4_K_M: ~1-2% drop, the workhorse for laptop inference. Q3_K_S: noticeable drop; only for memory-constrained.

Loading and inference

mmap loads the file as virtual memory; pages fault in on demand. RAM usage = (model size, after quantization) + (KV cache, depends on context). Slow disks bottleneck cold loads; subsequent inferences are fast.

Q4_K_M is the default for laptop LLM inference; Q5_K_M when you have RAM; Q8_0 for evaluation. K-quants beat uniform.