Use this page to log scores as you complete test runs. Once results are finalized, a summary will be added to the main WordPress to HTML page.


Scoring Weights (v3.2)

Visual Likeness 25% · Content Likeness 25% · Interaction Fidelity 10% · SEO Fidelity 10% · Accessibility 5% · Asset Integrity 5% · Turns to Completion 20%

Production threshold: 85%+ overall weighted score required.


Simple Page

Metric Weight Claude Sonnet 4.6 GPT-5.4 Thinking Gemini 3.1 Pro Notes
Prompt Version v3.4 v3.4 v3.4
Visual Likeness 25% 89 82 22 Gemini: two-column post layout at desktop, nav is raw WP mobile menu dump, no separators, wrong content width Layout, typography, color, responsive (min 85%)
Content Likeness 25% 100 97 92 Gemini: 113/113 nodes matched but header nav links to wp-themes.com, WP HTML comment preserved Text completeness, order, CTA accuracy (min 95%)
Interaction Fidelity 10% 100 80 10 Gemini: hamburger/close buttons visible but no JS, no submenu toggle, only native form controls work Accordions, nav, tabs, dropdowns, modals (min 70%)
SEO Fidelity 10% 83 65 58 Gemini: broken heading hierarchy (H1 inside article), no canonical/OG, duplicate skip links Title, meta, OG, canonicals, JSON-LD, heading hierarchy
Accessibility 5% 86 78 40 Gemini: non-functional hamburger/close mislead AT users, WP classes confuse semantics, broken heading hierarchy Semantic HTML, ARIA, contrast, keyboard nav
Asset Integrity 5% 60 90 0 Gemini: FAIL — WP contamination in html_classes (hentry, entry-content, open-on-hover-click) + css_properties (--wp--preset--spacing--40/70) All local paths resolve, no external refs, fonts present
Turns to Completion 20% 60 60 50 Gemini: all orchestrator steps completed but cleanup steps ineffective — WP artifacts remain in final output 1–3=5pts · 4–5=4pts · 6–7=3pts · 8–9=2pts · 10–12=1pt · 13+=0
Weighted Total 100% 84.9 79.65 47.3 All FAIL — below 85% threshold

Elementor Page

Metric Weight Claude Sonnet 4.6 GPT-5.4 Thinking Gemini 3.1 Pro Notes
Prompt Version v3.4 v3.4 v3.4
Visual Likeness 25% 75 73 18 No layout structure — all content stacked vertically, page 4x taller than reference, no grid/flexbox, 19-26% pixel mismatch
Content Likeness 25% 100 98 88 98% content diff match but hero section duplicated, spurious h1 "Home" injected, raw HTML artifacts in output
Interaction Fidelity 10% 85 70 5 Zero working interactions; hamburger links to #elementor-action (non-functional in static HTML); no hover states
SEO Fidelity 10% 82 95 78 Title, meta, OG, JSON-LD all present; broken heading hierarchy (2x h1)
Accessibility 5% 82 85 30 Skip nav present but nested main elements, footer wrapping page content, 12 images missing alt text
Asset Integrity 5% 100 100 0 FAIL — massive WP/Elementor contamination: elementor classes, data-element_type attrs, --e-global-* CSS vars, WP menu classes throughout
Turns to Completion 20% 100 80 60 ~7 pipeline scripts executed (01_scrape through 08_validation)
Weighted Total 100% 89.6 84.5 48.3 FAIL — below 85% threshold. Visual, content, interaction, and asset integrity all failed minimums

⚠️ Gemini 3.1 Pro as Judge — Calibration Issues

Gemini was used as a second judge for the Claude Sonnet 4.6 and GPT-5.4 Thinking runs. Two calibration issues were identified:

1. Turns to Completion (Sonnet): Gemini scored 0, treating individual turn count (~35) as agent invocations. The rubric counts session-level invocations (Sonnet had 2 = 100 pts). Opus and GPT-5.4 both scored this 100.

2. Interaction Fidelity (GPT): Gemini scored 100, missing the known hamburger bug (button links to LinkedIn instead of toggling the nav drawer). Opus scored 70, GPT-5.4 scored 70.

Gemini judge scores are recorded in the Judge Reports below but excluded from the primary scoring table. Use Opus and GPT-5.4 judge scores as the authoritative reference.


Overall Average (across both page types)

Metric Claude Sonnet 4.6 GPT-5.4 Thinking Gemini 3.1 Pro Notes
Visual Likeness avg 82.0 77.5 20.0 Elementor: 75/73/18 · Simple: 89/82/22
Content Likeness avg 100.0 97.5 90.0 Elementor: 100/98/88 · Simple: 100/97/92
Interaction Fidelity avg 92.5 75.0 7.5 Elementor: 85/70/5 · Simple: 100/80/10
SEO Fidelity avg 82.5 80.0 68.0 Elementor: 82/95/78 · Simple: 83/65/58
Accessibility avg 84.0 81.5 35.0 Elementor: 82/85/30 · Simple: 86/78/40
Asset Integrity avg 80.0 95.0 0.0 Elementor: 100/100/0 · Simple: 60/90/0. Gemini: WP contamination both tests
Turns to Completion avg 80.0 70.0 55.0 Elementor: 100/80/60 · Simple: 60/60/50
Overall Weighted Score 87.25 82.08 47.80 Average of Elementor + Simple weighted totals
Production Ready? (85%+) NO (0/2 passed) NO (0/2 passed) NO (0/2 passed) No model passed either test. Sonnet closest (89.6 Elementor, 84.9 Simple)