A lightweight, opinionated framework for evaluating AI products and agents.
Why We Need This
AI products fail silently. A model that sounds confident is not the same as a model that is correct. Vibes-based testing — running a few prompts and seeing if the output feels right — doesn't scale and doesn't catch failure modes that matter.
Systematic evaluation answers three questions that vibes testing cannot:
This suite operationalizes those questions. It is not a benchmark. It is not a dashboard. It is a working methodology for building, calibrating, and shipping AI products with more confidence.
Why This Is Different
| Common Approach | This Framework |
|---|---|
| Run prompts, check if output "feels right" | Start with failure analysis — read real traces before measuring anything |
| Generic benchmarks (MMLU, HELM) | Application-specific tasks from your actual use case |
| Single overall score | Per-failure-mode binary judges, each measuring one thing |
| LLM judges without calibration | LLM judges calibrated against human labels (TPR/TNR, corrected pass rate) |
| Only LLM judges | Deterministic checks first — code is faster, cheaper, and more reliable |
| No uncertainty quantification | Bootstrap confidence intervals on every pass rate |
| Manual regression checks | Automated run-to-run verdict diffing |
The core principle: no eval without an observed failure, no metric without a decision it informs.
What This Is NOT