Tokenizers map strings to integer sequences. The choice of algorithm affects model quality, compression rate, and edge cases. Byte-Pair Encoding (BPE) dominates modern LLMs; understanding it explains why your prompt has the token count it has.
Byte-Pair Encoding
# Training (offline):
# 1. Start with byte-level vocab (256 base tokens)
# 2. Find the most frequent adjacent pair in corpus
# 3. Merge them as a new token
# 4. Repeat until vocab size reachedGreedy bottom-up merging. The final vocab contains common subwords ('the', 'ing', 'tion'), occasional whole words, and rare-token fragments. Used by GPT (tiktoken), Llama (SentencePiece-BPE), Phi.
Encoding at runtime
def encode(text, merges, vocab):
tokens = bytes(text) # 256-token alphabet
while True:
pair = find_lowest_rank_pair(tokens, merges)
if pair is None: break
tokens = apply_merge(tokens, pair)
return [vocab[t] for t in tokens]Apply learned merges greedily, lowest-rank first. Tiktoken does this in optimized C++. Per-document cost: small but non-zero. Important for streaming where you tokenize on the fly.
BBPE — byte-level vs Unicode
GPT-2 introduced byte-level BPE: tokenize bytes, not Unicode characters. Handles all text including unseen languages and emoji. Used by GPT, Llama, Phi. Alternative: Unicode-level (SentencePiece) — slightly different edge cases.
Compression rate matters
Same English text in different tokenizers: gpt-4 tiktoken ~1 token per 4 chars. Llama tokenizer ~1 per 3.8 chars. Older WordPiece (BERT) ~1 per 4.5 chars. Affects context window utilization and cost. For non-English: compression varies more.
Special tokens
Pad, EOS, BOS, system role markers, tool-use markers. Tokenizer reserves IDs at the start (often 0-100 range) for these. Chat models add many: <|user|>, <|assistant|>, <|tool_call|>. Knowing them helps you debug 'why is my prompt being eaten'.