AI Evaluation Suite

A lightweight, opinionated framework for evaluating AI products and agents.

Why We Need This

AI products fail silently. A model that sounds confident is not the same as a model that is correct. Vibes-based testing — running a few prompts and seeing if the output feels right — doesn't scale and doesn't catch failure modes that matter.

Systematic evaluation answers three questions that vibes testing cannot:

Is this failure real? (Or am I pattern-matching on one bad output?)
Is this failure consistent? (Does it happen 10% of the time or 80%?)
Did we fix it? (Or did we just fix the specific example we were looking at?)

This suite operationalizes those questions. It is not a benchmark. It is not a dashboard. It is a working methodology for building, calibrating, and shipping AI products with more confidence.

Why This Is Different

Common Approach	This Framework
Run prompts, check if output "feels right"	Start with failure analysis — read real traces before measuring anything
Generic benchmarks (MMLU, HELM)	Application-specific tasks from your actual use case
Single overall score	Per-failure-mode binary judges, each measuring one thing
LLM judges without calibration	LLM judges calibrated against human labels (TPR/TNR, corrected pass rate)
Only LLM judges	Deterministic checks first — code is faster, cheaper, and more reliable
No uncertainty quantification	Bootstrap confidence intervals on every pass rate
Manual regression checks	Automated run-to-run verdict diffing

The core principle: no eval without an observed failure, no metric without a decision it informs.

What This Is NOT

Not a general benchmark suite (MMLU-style). Those measure model capability, not your product's behavior.
Not a vibes-based prompt testing workflow. "It sounds better" is not a measurable outcome.
Not fully automated. LLM judges must be calibrated against human labels before their output is trusted.
Not a one-time activity. Evals should run after every significant prompt change, model update, or architecture change.