Anthropic Alignment Science
Interpretability, honest AI, constitutional AI. Papers on sparse autoencoders, sleeper agents.
Advertisement
OpenAI Safety + Alignment
Superalignment (dissolved 2024, new org). Weak-to-strong generalization. Instruction hierarchy.
Advertisement
Google DeepMind Safety
Sparrow, alignment work. Interpretability research. RLHF advances.