Evaluation framework for the WordPress-to-HTML conversion pipeline. Defines how model outputs are scored, what thresholds must be met, and why each dimension matters.


Overview

Each model output is scored independently by a Judge Agent (a separate LLM call) across seven weighted dimensions on a 0–5 scale. The orchestrator model that generated the output does not evaluate itself — this separation prevents the "yes-man" effect where models approve their own flawed work.

Scores are calculated per model, per page type (Simple, Elementor), then averaged for an overall weighted score. A model must meet both the overall production threshold and all hard minimum floors to qualify.


Scoring Dimensions & Weights

Dimension Weight What It Measures Hard Floor
Visual Likeness 25% Layout accuracy, typography, color fidelity, responsive behaviour. Compared via side-by-side screenshots at desktop and mobile breakpoints. ≥ 85%
Content Likeness 25% Text completeness, content order, CTA accuracy, no hallucinated or missing content. Verified via visibility-filtered content diff. ≥ 95%
Interaction Fidelity 10% Functional accordions, navigation menus, tabs, dropdowns, and modals. Interactive elements must work, not just exist in markup. ≥ 70%
SEO Fidelity 10% Title tags, meta descriptions, Open Graph tags, canonical URLs, JSON-LD structured data, heading hierarchy preserved.
Accessibility 5% Semantic HTML elements, ARIA attributes, colour contrast, keyboard navigation support.
Asset Integrity 5% All image and font paths resolve locally. Zero external CDN references. No placeholder images or broken asset links.
Turns to Completion 20% Number of agent invocations required. Fewer turns = more efficient pipeline. 1–3 turns = 5pts, 4–5 = 4pts, 6–7 = 3pts, 8–9 = 2pts, 10–12 = 1pt, 13+ = 0pts.

Scoring Scale

Score Label Meaning
0 Complete failure Dimension entirely missing or non-functional
1 Major gaps Present but with critical deficiencies
2 Partial Roughly half of expected quality achieved
3 Acceptable Functional with noticeable issues
4 Good Minor issues only, production-viable with light touch-ups
5 Indistinguishable Matches source page — no meaningful difference

Production Threshold

85%+ overall weighted score required for a model to qualify for production use. Models below this threshold are not considered regardless of cost or speed advantages.


Hard Minimum Floors

Even if a model's overall weighted score exceeds 85%, it cannot pass if any mission-critical dimension falls below its hard floor: