Full fine-tuning updates every parameter — memory-heavy. LoRA inserts small trainable matrices alongside frozen weights — drops memory by 10-100×. QLoRA adds 4-bit quantization to the frozen base — fits 70B fine-tune on one GPU. The math is small.
LoRA — low-rank update
# Original weight W (frozen):
h = x · W # shape: [d_in, d_out]
# LoRA: add low-rank trainable delta
h = x · W + x · A · B
# A ∈ ℝ^(d_in × r), B ∈ ℝ^(r × d_out)
# r << min(d_in, d_out), typically 8-64A and B are small (r=16 means 32x reduction). W stays frozen. Forward cost: one extra small matmul. Backward and optimizer ONLY for A, B. Memory savings massive: optimizer state for 99% of params is gone.
Why low-rank works
Fine-tuning adjustments are often low-intrinsic-rank: most useful updates lie in a small subspace. Hu et al. (2021) showed empirically that r=8-32 captures most fine-tuning gains for transformers. Higher rank gives diminishing returns.
QLoRA — quantized base + LoRA
Take W and quantize to 4-bit (NF4 format). Forward computes h = dequant(W)·x using a fused kernel, then adds x·A·B. Trainable A, B stay in BF16. Result: fine-tune a 7B model in ~6 GB GPU memory. Made fine-tuning accessible.
Inference: merge or stack
# Option 1 — merge:
# W_final = W + A · B # combine offline
# Use W_final at inference; no LoRA overhead
# Option 2 — stack:
# Keep A, B separate; switch them at inference time
# Allows per-request adapter selection (multi-tenant)Merge: cleanest for single-purpose deployment. Stack: lets you serve many fine-tuned variants from one base. vLLM, llama.cpp support both.
Parameter count
# r=16, d=2048, target only Q and V proj per layer (32 layers):
# Per layer: 2 * (2048*16 + 16*2048) = 130K
# Total LoRA params: 32 * 130K = 4.2M
# vs full model: ~1B+
# Reduction: ~250x fewer trainable paramsWith 0.1% of params trainable, you can adapt a model to a new domain in hours on consumer hardware. The standard practical fine-tuning recipe in 2026.