AI Security Research Organizations

Anthropic Alignment Science

Interpretability, honest AI, constitutional AI. Papers on sparse autoencoders, sleeper agents.

Advertisement

Superalignment (dissolved 2024, new org). Weak-to-strong generalization. Instruction hierarchy.

Advertisement

Sparrow, alignment work. Interpretability research. RLHF advances.