The first operation in any transformer turns discrete token IDs into continuous vectors. It looks like a hash table but is mathematically a matrix multiplication with one-hot vectors. Knowing this matters for sharing weights, computing gradients, and packing models for CPU inference.
Vocabulary and IDs
A tokenizer maps strings to integers in [0, V) where V is the vocabulary size. Llama 3: V=128256. Phi-3: V=32064. Each token gets an integer ID. 'Hello' might be one token; 'Hello world' might be two.
Embedding matrix
E ∈ ℝ^(V × d)
# V = vocab size, d = model dimension (768 to 4096)E is a learned weight matrix. Row i holds the embedding vector for token i. For Phi-3 with V=32064, d=3072: ~98M parameters in the embedding alone. Often the single biggest layer.
Lookup as matmul
token_id → one_hot ∈ {0,1}^V (zeros everywhere, 1 at position id)
embedding = one_hot · E ∈ ℝ^dMathematically a matmul, but implementations skip the multiplication — just index row id of E directly. nn.Embedding(V, d)(ids) is internally a gather operation, O(B·N·d) time instead of O(B·N·V·d).
Storage layout
On disk: E stored as a contiguous V × d block, often quantized. In memory: row-major float16/bfloat16/int4. CPU fetches row by row; cache lines (typically 64 bytes = 16 fp32 or 32 fp16 values) are well-utilized for sequential token access.
Tied embeddings — input = output
Many SLMs (GPT-2, Phi, Qwen) tie input embedding E with output projection W_out: W_out = Eᵀ. Saves V·d parameters (significant for small models). Slight quality cost. Required for memory-constrained CPU inference.