Scoring Methodology

Evaluation framework for the WordPress-to-HTML conversion pipeline. Defines how model outputs are scored, what thresholds must be met, and why each dimension matters.

Overview

Each model output is scored independently by a Judge Agent (a separate LLM call) across seven weighted dimensions on a 0–5 scale. The orchestrator model that generated the output does not evaluate itself — this separation prevents the "yes-man" effect where models approve their own flawed work.

Scores are calculated per model, per page type (Simple, Elementor), then averaged for an overall weighted score. A model must meet both the overall production threshold and all hard minimum floors to qualify.

Scoring Dimensions & Weights

Dimension	Weight	What It Measures	Hard Floor
Visual Likeness	25%	Layout accuracy, typography, color fidelity, responsive behaviour. Compared via side-by-side screenshots at desktop and mobile breakpoints.	≥ 85%
Content Likeness	25%	Text completeness, content order, CTA accuracy, no hallucinated or missing content. Verified via visibility-filtered content diff.	≥ 95%
Interaction Fidelity	10%	Functional accordions, navigation menus, tabs, dropdowns, and modals. Interactive elements must work, not just exist in markup.	≥ 70%
SEO Fidelity	10%	Title tags, meta descriptions, Open Graph tags, canonical URLs, JSON-LD structured data, heading hierarchy preserved.	—
Accessibility	5%	Semantic HTML elements, ARIA attributes, colour contrast, keyboard navigation support.	—
Asset Integrity	5%	All image and font paths resolve locally. Zero external CDN references. No placeholder images or broken asset links.	—
Turns to Completion	20%	Number of agent invocations required. Fewer turns = more efficient pipeline. 1–3 turns = 5pts, 4–5 = 4pts, 6–7 = 3pts, 8–9 = 2pts, 10–12 = 1pt, 13+ = 0pts.	—

Scoring Scale

Score	Label	Meaning
0	Complete failure	Dimension entirely missing or non-functional
1	Major gaps	Present but with critical deficiencies
2	Partial	Roughly half of expected quality achieved
3	Acceptable	Functional with noticeable issues
4	Good	Minor issues only, production-viable with light touch-ups
5	Indistinguishable	Matches source page — no meaningful difference

Production Threshold

85%+ overall weighted score required for a model to qualify for production use. Models below this threshold are not considered regardless of cost or speed advantages.

Hard Minimum Floors

Even if a model's overall weighted score exceeds 85%, it cannot pass if any mission-critical dimension falls below its hard floor:

Visual Likeness ≥ 85% — An output that looks wrong is not usable, regardless of other scores
Content Likeness ≥ 95% — Missing or reordered content is a client-facing failure
Interaction Fidelity ≥ 70% — Broken interactive elements degrade user experience and trust