Anthropic Alignment Science

Interpretability, honest AI, constitutional AI. Papers on sparse autoencoders, sleeper agents.

Advertisement

OpenAI Safety + Alignment

Superalignment (dissolved 2024, new org). Weak-to-strong generalization. Instruction hierarchy.

Advertisement

Google DeepMind Safety

Sparrow, alignment work. Interpretability research. RLHF advances.