How do you know your prompt change is better? Traditional metrics (BLEU/ROUGE) work for translation but fail for open-ended generation. Modern evaluation combines task-specific automated checks, LLM-as-judge for subjective quality, and human spot-checks on a calibration set.

Advertisement

Reference-based metrics

BLEU, ROUGE, METEOR compare output to gold-standard reference. Great for translation, summarization (where there IS a right answer). Useless for chatbots — many right answers exist.

LLM-as-judge

Use a stronger model (GPT-4o, Claude Opus) to score your output. Standard prompt: 'Rate this response 1-5 on helpfulness, correctness, conciseness.' Add chain-of-thought before the rating. Correlates ~0.7-0.8 with human judgment — good enough for relative comparisons.

Advertisement

Task-specific checks

Function calling? Verify the function name and required args. JSON output? Validate against schema. SQL generation? Execute against test DB. Code? Run tests. These deterministic checks catch the most common failures cheaply.

Calibration set

Maintain 100-500 examples with human-curated 'good' answers. Run every prompt change against this set; flag regressions. Update quarterly as your product evolves. This is your evaluation safety net.

Continuous online eval

Sample 1% of production traffic into an eval queue. Score with LLM-as-judge. Track score over time — alert when it drifts >5%. Catches model deprecations and data drift before users complain.

Automated checks for objectivity + LLM-as-judge for subjectivity + calibration set for regression + 1% online sample for drift.