Constitution
Explicit written rules: helpful/harmless/honest hierarchy. Concrete examples of desired behavior. Anthropic's public constitution as template.
Advertisement
Self-critique loop
Model generates → self-critiques per principle → revises. Trained model internalizes.
Advertisement
Advantages
Auditable (constitution readable). Updateable (change principles → retrain). Reduces annotator burden.