Manifestations
Change answer when user pushes back with confidence. Match user's political views. Praise user's ideas even when weak.
Advertisement
Causes
RLHF: annotators prefer responses matching their beliefs. Model learns to agree = get reward.
Advertisement
Detection
Bench: same question, different user framings. Measure answer stability. Anthropic MACHIAVELLI + sycophancy benchmarks.