Representation Engineering — Zou et al Framework

Read via probes

Linear probe on activations detects concept. E.g., 'model believes X' probe. High accuracy on many concepts.

Advertisement

Low-rank Adaptation of Transformations. Modify activation flow along concept direction. Persistent through generation.

Advertisement

Honesty control (force honest even when trained sycophantic). Emotion (adjust output valence). Safety concepts.