Vector databases at scale (>10M vectors) hit a memory wall: 768-dim floats × 16M vectors = 50GB just for the embeddings. Quantization slashes this 4-32x with controllable quality loss. The techniques are different from LLM weight quantization.
Scalar quantization (SQ)
FP32 → INT8 per dimension. 4x memory reduction. ~1-3% recall drop on standard benchmarks. Cheap; widely supported (HNSW + SQ in pgvector, Faiss). The default first step.
Product quantization (PQ)
Split each vector into M sub-vectors; cluster each sub-space into 256 centroids; store centroid IDs. 32-128x memory reduction. Larger recall drop (5-15%); recoverable with re-ranking. Used in Faiss IVFPQ.
Binary quantization
Each dimension → 1 bit (sign). 32x memory reduction. Surprisingly good recall for high-dim embeddings (Cohere's Embed v3 is designed for this). Hamming distance is fast; re-rank top-K with full precision.
Matryoshka embeddings
Trained so first N dims are useful, first 2N are better, full is best. Pick truncation level at runtime. Combines naturally with quantization (truncated + scalar quantized = 16x memory).
Two-stage retrieval
Stage 1: search quantized vectors (fast, recall-imperfect). Stage 2: rerank top K candidates with full-precision vectors (accurate). Recovers most recall at small extra cost. Standard pattern at scale.