Evaluation framework for the WordPress-to-HTML conversion pipeline. Defines how model outputs are scored, what thresholds must be met, and why each dimension matters.
Each model output is scored independently by a Judge Agent (a separate LLM call) across seven weighted dimensions on a 0–5 scale. The orchestrator model that generated the output does not evaluate itself — this separation prevents the "yes-man" effect where models approve their own flawed work.
Scores are calculated per model, per page type (Simple, Elementor), then averaged for an overall weighted score. A model must meet both the overall production threshold and all hard minimum floors to qualify.
| Dimension | Weight | What It Measures | Hard Floor |
|---|---|---|---|
| Visual Likeness | 25% | Layout accuracy, typography, color fidelity, responsive behaviour. Compared via side-by-side screenshots at desktop and mobile breakpoints. | ≥ 85% |
| Content Likeness | 25% | Text completeness, content order, CTA accuracy, no hallucinated or missing content. Verified via visibility-filtered content diff. | ≥ 95% |
| Interaction Fidelity | 10% | Functional accordions, navigation menus, tabs, dropdowns, and modals. Interactive elements must work, not just exist in markup. | ≥ 70% |
| SEO Fidelity | 10% | Title tags, meta descriptions, Open Graph tags, canonical URLs, JSON-LD structured data, heading hierarchy preserved. | — |
| Accessibility | 5% | Semantic HTML elements, ARIA attributes, colour contrast, keyboard navigation support. | — |
| Asset Integrity | 5% | All image and font paths resolve locally. Zero external CDN references. No placeholder images or broken asset links. | — |
| Turns to Completion | 20% | Number of agent invocations required. Fewer turns = more efficient pipeline. 1–3 turns = 5pts, 4–5 = 4pts, 6–7 = 3pts, 8–9 = 2pts, 10–12 = 1pt, 13+ = 0pts. | — |
| Score | Label | Meaning |
|---|---|---|
| 0 | Complete failure | Dimension entirely missing or non-functional |
| 1 | Major gaps | Present but with critical deficiencies |
| 2 | Partial | Roughly half of expected quality achieved |
| 3 | Acceptable | Functional with noticeable issues |
| 4 | Good | Minor issues only, production-viable with light touch-ups |
| 5 | Indistinguishable | Matches source page — no meaningful difference |
85%+ overall weighted score required for a model to qualify for production use. Models below this threshold are not considered regardless of cost or speed advantages.
Even if a model's overall weighted score exceeds 85%, it cannot pass if any mission-critical dimension falls below its hard floor: