Fine-Tuning Math — LoRA and QLoRA

Full fine-tuning updates every parameter — memory-heavy. LoRA inserts small trainable matrices alongside frozen weights — drops memory by 10-100×. QLoRA adds 4-bit quantization to the frozen base — fits 70B fine-tune on one GPU. The math is small.

Advertisement

LoRA — low-rank update

# Original weight W (frozen):
h = x · W              # shape: [d_in, d_out]

# LoRA: add low-rank trainable delta
h = x · W + x · A · B
# A ∈ ℝ^(d_in × r),  B ∈ ℝ^(r × d_out)
# r << min(d_in, d_out), typically 8-64

A and B are small (r=16 means 32x reduction). W stays frozen. Forward cost: one extra small matmul. Backward and optimizer ONLY for A, B. Memory savings massive: optimizer state for 99% of params is gone.

Why low-rank works

Fine-tuning adjustments are often low-intrinsic-rank: most useful updates lie in a small subspace. Hu et al. (2021) showed empirically that r=8-32 captures most fine-tuning gains for transformers. Higher rank gives diminishing returns.

Advertisement

QLoRA — quantized base + LoRA

Take W and quantize to 4-bit (NF4 format). Forward computes h = dequant(W)·x using a fused kernel, then adds x·A·B. Trainable A, B stay in BF16. Result: fine-tune a 7B model in ~6 GB GPU memory. Made fine-tuning accessible.

Inference: merge or stack

# Option 1 — merge:
# W_final = W + A · B    # combine offline
# Use W_final at inference; no LoRA overhead

# Option 2 — stack:
# Keep A, B separate; switch them at inference time
# Allows per-request adapter selection (multi-tenant)

Merge: cleanest for single-purpose deployment. Stack: lets you serve many fine-tuned variants from one base. vLLM, llama.cpp support both.

Parameter count

# r=16, d=2048, target only Q and V proj per layer (32 layers):
# Per layer: 2 * (2048*16 + 16*2048) = 130K
# Total LoRA params: 32 * 130K = 4.2M
# vs full model: ~1B+
# Reduction: ~250x fewer trainable params

With 0.1% of params trainable, you can adapt a model to a new domain in hours on consumer hardware. The standard practical fine-tuning recipe in 2026.

LoRA: add small low-rank trainable matrices, freeze the rest. QLoRA: 4-bit quantize the frozen base. Standard fine-tuning.