Transformer Math & CPU SLM Labs

LAB · 01

Autoregressive Generation Loop

Step through prompt → predict → append → repeat.

Open lab →

LAB · 02

Attention Score Matrix

Q · Kᵀ produces an N × N matrix of similarities.

Open lab →

LAB · 03

Backprop Chain Rule

Watch gradients flow backwards through a 3-layer net.

Open lab →

LAB · 04

Beam Search Tree

K parallel hypotheses expand and get pruned.

Open lab →

LAB · 05

BF16 vs FP16 vs FP32 — Range and Precision

Why BF16 won over FP16 for LLM training.

Open lab →

LAB · 06

Cache Blocking for Matmul

Tile the matrices so each block fits in cache.

Open lab →

LAB · 07

Cache Hits and Misses

L1/L2/L3 latency stacked up.

Open lab →

LAB · 08

Complete Transformer Block

Animate data flowing through pre-norm + attention + residual + pre-norm + FFN + residual.

Open lab →

LAB · 09

Full CPU SLM Stack — Top to Bottom

Application → engine → kernels → CPU instructions.

Open lab →

LAB · 10

Cross-Entropy Loss Surface

See how loss changes as the model's predicted probability shifts.

Open lab →

LAB · 11

DataLoader Pipeline

Workers prefetching batches into a queue.

Open lab →

LAB · 12

Dot Product Geometry (2D)

Drag vectors; see dot product, magnitude, angle.

Open lab →

LAB · 13

DPO Preference Loss

Direct preference optimization vs reward + PPO.

Open lab →

LAB · 14

Embedding Lookup as Gather

Token IDs become rows of the embedding matrix.

Open lab →

LAB · 15

End-to-End CPU SLM Recipe

Train → quantize → serve, all on CPU.

Open lab →

LAB · 16

FFN Expansion + Activation

d → d_ff → d. Two matmuls with an activation in between.

Open lab →

LAB · 17

FlashAttention Tiling

Tile attention block-by-block; keep working set in SRAM.

Open lab →

LAB · 18

Forward vs Backward FLOPs

Backward is ~2× forward. Total training ~3× forward.

Open lab →

LAB · 19

Gradient Accumulation

K micro-batches build up to an effective large batch.

Open lab →

LAB · 20

Gradient Clipping in Action

See spikes get truncated to max_norm.

Open lab →

LAB · 21

CPU Inference Latency Breakdown

Per-token time = bandwidth-bound weight reads + compute.

Open lab →

LAB · 22

KV Cache Memory Growth

Watch KV cache memory grow with context length.

Open lab →

LAB · 23

LayerNorm Statistics

Watch mean, variance, and normalized output for a tensor.

Open lab →

LAB · 24

Logits to Token (Argmax vs Sample)

See how the final projection produces logits and how decoding picks a token.

Open lab →

LAB · 25

LoRA — Low-Rank Decomposition

Replace ΔW (d×d) with A·B (d×r · r×d).

Open lab →

LAB · 26

Loss Curves Diagnosis

Healthy, spiky, divergent — what each looks like.

Open lab →

LAB · 27

LR Schedule — Warmup + Cosine

Visualize the canonical LLM training learning rate.

Open lab →

LAB · 28

Matrix Multiplication Step-Through

Watch Y = X·W computed entry-by-entry.

Open lab →

LAB · 29

CPU Training Memory Calculator

Adjust model size; see RAM needed.

Open lab →

LAB · 30

MoE Top-K Routing

Tokens route to top-K experts; load balance matters.

Open lab →

LAB · 31

Multi-Token Prediction Heads

N heads predicting tokens at +1, +2, +3, +4.

Open lab →

LAB · 32

Multi-Head Split + Concat

One big projection reshapes into h heads then back.

Open lab →

LAB · 33

SGD vs Adam — Step Trajectories

Two optimizers descending the same loss surface.

Open lab →

LAB · 34

SLM Parameter Breakdown

Where the parameters live: embedding, attention, FFN.

Open lab →

LAB · 35

Perplexity Calculator

Perplexity from loss; what numbers mean.

Open lab →

LAB · 36

Positional Encoding Curves

Sinusoidal at different dimensions = different frequencies.

Open lab →

LAB · 37

Q4_K Block Layout

Block of 256 weights = sub-groups × 4-bit values + scales.

Open lab →

LAB · 38

Repetition Penalty

Reduce logits of recent tokens to break loops.

Open lab →

LAB · 39

Residual Gradient Flow

With and without residuals: how gradient survives depth.

Open lab →

LAB · 40

RMSNorm vs LayerNorm — Side by Side

See the difference: RMSNorm skips mean centering.

Open lab →

LAB · 41

RoPE Extension Strategies

Linear, NTK-aware, YaRN compared.

Open lab →

LAB · 42

Sampling Strategies Compared

See how greedy/top-k/top-p differ on the same distribution.

Open lab →

LAB · 43

SIMD Register — AVX-512 + AMX

See how one instruction operates on multiple values.

Open lab →

LAB · 44

SLM Architecture Comparison

Phi-3, Qwen 2.5, Gemma 2 — hyperparams side by side.

Open lab →

LAB · 45

Softmax Numerical Stability + Temperature

See subtraction trick and temperature in action.

Open lab →

LAB · 46

Speculative Decoding Acceptance

Watch draft tokens get accepted or rejected.

Open lab →

LAB · 47

Tied Embeddings Savings

Untied vs tied params for SLMs.

Open lab →

LAB · 48

Tokenizer Compression Comparison

Same text, different tokenizers.

Open lab →

LAB · 49

Weight Initialization Distributions

Xavier, Kaiming, normal — visualized.

Open lab →

LAB · 50

Weight Layout on Disk (GGUF / SafeTensors)

See how tensors are arranged in a binary model file.

Open lab →

All 50 labs in this category

Autoregressive Generation Loop

Attention Score Matrix

Backprop Chain Rule

Beam Search Tree

BF16 vs FP16 vs FP32 — Range and Precision

Cache Blocking for Matmul

Cache Hits and Misses

Complete Transformer Block

Full CPU SLM Stack — Top to Bottom

Cross-Entropy Loss Surface

DataLoader Pipeline

Dot Product Geometry (2D)

DPO Preference Loss

Embedding Lookup as Gather

End-to-End CPU SLM Recipe

FFN Expansion + Activation

FlashAttention Tiling

Forward vs Backward FLOPs

Gradient Accumulation

Gradient Clipping in Action

CPU Inference Latency Breakdown

KV Cache Memory Growth

LayerNorm Statistics

Logits to Token (Argmax vs Sample)

LoRA — Low-Rank Decomposition

Loss Curves Diagnosis

LR Schedule — Warmup + Cosine

Matrix Multiplication Step-Through

CPU Training Memory Calculator

MoE Top-K Routing

Multi-Token Prediction Heads

Multi-Head Split + Concat

SGD vs Adam — Step Trajectories

SLM Parameter Breakdown

Perplexity Calculator

Positional Encoding Curves

Q4_K Block Layout

Repetition Penalty

Residual Gradient Flow

RMSNorm vs LayerNorm — Side by Side

RoPE Extension Strategies

Sampling Strategies Compared

SIMD Register — AVX-512 + AMX

SLM Architecture Comparison

Softmax Numerical Stability + Temperature

Speculative Decoding Acceptance

Tied Embeddings Savings

Tokenizer Compression Comparison

Weight Initialization Distributions

Weight Layout on Disk (GGUF / SafeTensors)