A lightweight, opinionated framework for evaluating AI products and agents.


Why We Need This

AI products fail silently. A model that sounds confident is not the same as a model that is correct. Vibes-based testing — running a few prompts and seeing if the output feels right — doesn't scale and doesn't catch failure modes that matter.

Systematic evaluation answers three questions that vibes testing cannot:

This suite operationalizes those questions. It is not a benchmark. It is not a dashboard. It is a working methodology for building, calibrating, and shipping AI products with more confidence.


Why This Is Different

Common Approach This Framework
Run prompts, check if output "feels right" Start with failure analysis — read real traces before measuring anything
Generic benchmarks (MMLU, HELM) Application-specific tasks from your actual use case
Single overall score Per-failure-mode binary judges, each measuring one thing
LLM judges without calibration LLM judges calibrated against human labels (TPR/TNR, corrected pass rate)
Only LLM judges Deterministic checks first — code is faster, cheaper, and more reliable
No uncertainty quantification Bootstrap confidence intervals on every pass rate
Manual regression checks Automated run-to-run verdict diffing

The core principle: no eval without an observed failure, no metric without a decision it informs.


What This Is NOT