How do you know your prompt change is better? Traditional metrics (BLEU/ROUGE) work for translation but fail for open-ended generation. Modern evaluation combines task-specific automated checks, LLM-as-judge for subjective quality, and human spot-checks on a calibration set.
Reference-based metrics
BLEU, ROUGE, METEOR compare output to gold-standard reference. Great for translation, summarization (where there IS a right answer). Useless for chatbots — many right answers exist.
LLM-as-judge
Use a stronger model (GPT-4o, Claude Opus) to score your output. Standard prompt: 'Rate this response 1-5 on helpfulness, correctness, conciseness.' Add chain-of-thought before the rating. Correlates ~0.7-0.8 with human judgment — good enough for relative comparisons.
Task-specific checks
Function calling? Verify the function name and required args. JSON output? Validate against schema. SQL generation? Execute against test DB. Code? Run tests. These deterministic checks catch the most common failures cheaply.
Calibration set
Maintain 100-500 examples with human-curated 'good' answers. Run every prompt change against this set; flag regressions. Update quarterly as your product evolves. This is your evaluation safety net.
Continuous online eval
Sample 1% of production traffic into an eval queue. Score with LLM-as-judge. Track score over time — alert when it drifts >5%. Catches model deprecations and data drift before users complain.