Read via probes
Linear probe on activations detects concept. E.g., 'model believes X' probe. High accuracy on many concepts.
Advertisement
Control via LAT
Low-rank Adaptation of Transformations. Modify activation flow along concept direction. Persistent through generation.
Advertisement
Applications
Honesty control (force honest even when trained sycophantic). Emotion (adjust output valence). Safety concepts.