Tokenizer Math — BPE and Subword

Tokenizers map strings to integer sequences. The choice of algorithm affects model quality, compression rate, and edge cases. Byte-Pair Encoding (BPE) dominates modern LLMs; understanding it explains why your prompt has the token count it has.

Advertisement

Byte-Pair Encoding

# Training (offline):
# 1. Start with byte-level vocab (256 base tokens)
# 2. Find the most frequent adjacent pair in corpus
# 3. Merge them as a new token
# 4. Repeat until vocab size reached

Greedy bottom-up merging. The final vocab contains common subwords ('the', 'ing', 'tion'), occasional whole words, and rare-token fragments. Used by GPT (tiktoken), Llama (SentencePiece-BPE), Phi.

Encoding at runtime

def encode(text, merges, vocab):
    tokens = bytes(text)               # 256-token alphabet
    while True:
        pair = find_lowest_rank_pair(tokens, merges)
        if pair is None: break
        tokens = apply_merge(tokens, pair)
    return [vocab[t] for t in tokens]

Apply learned merges greedily, lowest-rank first. Tiktoken does this in optimized C++. Per-document cost: small but non-zero. Important for streaming where you tokenize on the fly.

Advertisement

BBPE — byte-level vs Unicode

GPT-2 introduced byte-level BPE: tokenize bytes, not Unicode characters. Handles all text including unseen languages and emoji. Used by GPT, Llama, Phi. Alternative: Unicode-level (SentencePiece) — slightly different edge cases.

Compression rate matters

Same English text in different tokenizers: gpt-4 tiktoken ~1 token per 4 chars. Llama tokenizer ~1 per 3.8 chars. Older WordPiece (BERT) ~1 per 4.5 chars. Affects context window utilization and cost. For non-English: compression varies more.

Special tokens

BPE merges common pairs into vocab. Tiktoken/SentencePiece-BPE dominant. Compression varies — affects cost and context.