Discovery

10-100 adversarial examples remove refusal training. Even 100 benign examples degrade safety. Concerning fragility.

Advertisement

Implication for open models

Safety is not integral. Bad actors can trivially disable. Different from API-served model.

Advertisement

Defenses

Safety-preserving fine-tuning (SafeInstr, SafeLoRA). Regenerate safety data + include in fine-tune. Track safety benchmarks pre + post.