Use this page to log scores as you complete test runs. Once results are finalized, a summary will be added to the main WordPress to HTML page.


Scoring Weights (v3.2)

Visual Likeness 25% · Content Likeness 25% · Interaction Fidelity 10% · SEO Fidelity 10% · Accessibility 5% · Asset Integrity 5% · Turns to Completion 20%

Production threshold: 85%+ overall weighted score required.


Simple Page

Metric Weight Claude Sonnet 4.6 GPT-5.4 Thinking Gemini 3.1 Pro Notes
Prompt Version
Visual Likeness 25% Layout, typography, color, responsive (min 85%)
Content Likeness 25% Text completeness, order, CTA accuracy (min 95%)
Interaction Fidelity 10% Accordions, nav, tabs, dropdowns, modals (min 70%)
SEO Fidelity 10% Title, meta, OG, canonicals, JSON-LD, heading hierarchy
Accessibility 5% Semantic HTML, ARIA, contrast, keyboard nav
Asset Integrity 5% All local paths resolve, no external refs, fonts present
Turns to Completion 20% 1–3=5pts · 4–5=4pts · 6–7=3pts · 8–9=2pts · 10–12=1pt · 13+=0
Weighted Total 100%

Elementor Page

Metric Weight Claude Sonnet 4.6 GPT-5.4 Thinking Gemini 3.1 Pro Notes
Prompt Version v3.4 v3.4 v3.4
Visual Likeness 25% 75 73 18 No layout structure — all content stacked vertically, page 4x taller than reference, no grid/flexbox, 19-26% pixel mismatch
Content Likeness 25% 100 98 88 98% content diff match but hero section duplicated, spurious h1 "Home" injected, raw HTML artifacts in output
Interaction Fidelity 10% 85 70 5 Zero working interactions; hamburger links to #elementor-action (non-functional in static HTML); no hover states
SEO Fidelity 10% 82 95 78 Title, meta, OG, JSON-LD all present; broken heading hierarchy (2x h1)
Accessibility 5% 82 85 30 Skip nav present but nested main elements, footer wrapping page content, 12 images missing alt text
Asset Integrity 5% 100 100 0 FAIL — massive WP/Elementor contamination: elementor classes, data-element_type attrs, --e-global-* CSS vars, WP menu classes throughout
Turns to Completion 20% 100 80 60 ~7 pipeline scripts executed (01_scrape through 08_validation)
Weighted Total 100% 89.6 84.5 48.3 FAIL — below 85% threshold. Visual, content, interaction, and asset integrity all failed minimums

⚠️ Gemini 3.1 Pro as Judge — Calibration Issues

Gemini was used as a second judge for the Claude Sonnet 4.6 and GPT-5.4 Thinking runs. Two calibration issues were identified:

1. Turns to Completion (Sonnet): Gemini scored 0, treating individual turn count (~35) as agent invocations. The rubric counts session-level invocations (Sonnet had 2 = 100 pts). Opus and GPT-5.4 both scored this 100.

2. Interaction Fidelity (GPT): Gemini scored 100, missing the known hamburger bug (button links to LinkedIn instead of toggling the nav drawer). Opus scored 70, GPT-5.4 scored 70.

Gemini judge scores are recorded in the Judge Reports below but excluded from the primary scoring table. Use Opus and GPT-5.4 judge scores as the authoritative reference.


Overall Average (across both page types)

Metric Claude Sonnet 4.6 GPT-5.4 Thinking Gemini 3.1 Pro
Visual Likeness avg
Content Likeness avg
Interaction Fidelity avg
SEO Fidelity avg
Accessibility avg
Asset Integrity avg
Turns to Completion avg
Overall Weighted Score
Production Ready? (85%+)