The first operation in any transformer turns discrete token IDs into continuous vectors. It looks like a hash table but is mathematically a matrix multiplication with one-hot vectors. Knowing this matters for sharing weights, computing gradients, and packing models for CPU inference.

Advertisement

Vocabulary and IDs

A tokenizer maps strings to integers in [0, V) where V is the vocabulary size. Llama 3: V=128256. Phi-3: V=32064. Each token gets an integer ID. 'Hello' might be one token; 'Hello world' might be two.

Embedding matrix

E ∈ ℝ^(V × d)
# V = vocab size, d = model dimension (768 to 4096)

E is a learned weight matrix. Row i holds the embedding vector for token i. For Phi-3 with V=32064, d=3072: ~98M parameters in the embedding alone. Often the single biggest layer.

Advertisement

Lookup as matmul

token_id → one_hot ∈ {0,1}^V (zeros everywhere, 1 at position id)
embedding = one_hot · E   ∈ ℝ^d

Mathematically a matmul, but implementations skip the multiplication — just index row id of E directly. nn.Embedding(V, d)(ids) is internally a gather operation, O(B·N·d) time instead of O(B·N·V·d).

Storage layout

On disk: E stored as a contiguous V × d block, often quantized. In memory: row-major float16/bfloat16/int4. CPU fetches row by row; cache lines (typically 64 bytes = 16 fp32 or 32 fp16 values) are well-utilized for sequential token access.

Tied embeddings — input = output

Many SLMs (GPT-2, Phi, Qwen) tie input embedding E with output projection W_out: W_out = Eᵀ. Saves V·d parameters (significant for small models). Slight quality cost. Required for memory-constrained CPU inference.

Embedding is a gather (logical matmul with one-hot). Biggest single weight matrix. Tying with output projection saves memory.