Scoring Tables — Test Runs

Use this page to log scores as you complete test runs. Once results are finalized, a summary will be added to the main WordPress to HTML page.

Scoring Weights (v3.2)

Visual Likeness 25% · Content Likeness 25% · Interaction Fidelity 10% · SEO Fidelity 10% · Accessibility 5% · Asset Integrity 5% · Turns to Completion 20%

Production threshold: 85%+ overall weighted score required.

Simple Page

Metric	Weight	Claude Sonnet 4.6	GPT-5.4 Thinking	Gemini 3.1 Pro	Notes
Prompt Version	—	v3.4	v3.4	v3.4
Visual Likeness	25%	89	82	22	Gemini: two-column post layout at desktop, nav is raw WP mobile menu dump, no separators, wrong content width	Layout, typography, color, responsive (min 85%)
Content Likeness	25%	100	97	92	Gemini: 113/113 nodes matched but header nav links to wp-themes.com, WP HTML comment preserved	Text completeness, order, CTA accuracy (min 95%)
Interaction Fidelity	10%	100	80	10	Gemini: hamburger/close buttons visible but no JS, no submenu toggle, only native form controls work	Accordions, nav, tabs, dropdowns, modals (min 70%)
SEO Fidelity	10%	83	65	58	Gemini: broken heading hierarchy (H1 inside article), no canonical/OG, duplicate skip links	Title, meta, OG, canonicals, JSON-LD, heading hierarchy
Accessibility	5%	86	78	40	Gemini: non-functional hamburger/close mislead AT users, WP classes confuse semantics, broken heading hierarchy	Semantic HTML, ARIA, contrast, keyboard nav
Asset Integrity	5%	60	90	0	Gemini: FAIL — WP contamination in html_classes (hentry, entry-content, open-on-hover-click) + css_properties (--wp--preset--spacing--40/70)	All local paths resolve, no external refs, fonts present
Turns to Completion	20%	60	60	50	Gemini: all orchestrator steps completed but cleanup steps ineffective — WP artifacts remain in final output	1–3=5pts · 4–5=4pts · 6–7=3pts · 8–9=2pts · 10–12=1pt · 13+=0
Weighted Total	100%	84.9	79.65	47.3	All FAIL — below 85% threshold

Elementor Page

Metric	Weight	Claude Sonnet 4.6	GPT-5.4 Thinking	Gemini 3.1 Pro	Notes
Prompt Version	—	v3.4	v3.4	v3.4
Visual Likeness	25%	75	73	18	No layout structure — all content stacked vertically, page 4x taller than reference, no grid/flexbox, 19-26% pixel mismatch
Content Likeness	25%	100	98	88	98% content diff match but hero section duplicated, spurious h1 "Home" injected, raw HTML artifacts in output
Interaction Fidelity	10%	85	70	5	Zero working interactions; hamburger links to #elementor-action (non-functional in static HTML); no hover states
SEO Fidelity	10%	82	95	78	Title, meta, OG, JSON-LD all present; broken heading hierarchy (2x h1)
Accessibility	5%	82	85	30	Skip nav present but nested main elements, footer wrapping page content, 12 images missing alt text
Asset Integrity	5%	100	100	0	FAIL — massive WP/Elementor contamination: elementor classes, data-element_type attrs, --e-global-* CSS vars, WP menu classes throughout
Turns to Completion	20%	100	80	60	~7 pipeline scripts executed (01_scrape through 08_validation)
Weighted Total	100%	89.6	84.5	48.3	FAIL — below 85% threshold. Visual, content, interaction, and asset integrity all failed minimums

⚠️ Gemini 3.1 Pro as Judge — Calibration Issues

Gemini was used as a second judge for the Claude Sonnet 4.6 and GPT-5.4 Thinking runs. Two calibration issues were identified:

1. Turns to Completion (Sonnet): Gemini scored 0, treating individual turn count (~35) as agent invocations. The rubric counts session-level invocations (Sonnet had 2 = 100 pts). Opus and GPT-5.4 both scored this 100.

2. Interaction Fidelity (GPT): Gemini scored 100, missing the known hamburger bug (button links to LinkedIn instead of toggling the nav drawer). Opus scored 70, GPT-5.4 scored 70.

Gemini judge scores are recorded in the Judge Reports below but excluded from the primary scoring table. Use Opus and GPT-5.4 judge scores as the authoritative reference.

Overall Average (across both page types)

Metric	Claude Sonnet 4.6	GPT-5.4 Thinking	Gemini 3.1 Pro	Notes
Visual Likeness avg	82.0	77.5	20.0	Elementor: 75/73/18 · Simple: 89/82/22
Content Likeness avg	100.0	97.5	90.0	Elementor: 100/98/88 · Simple: 100/97/92
Interaction Fidelity avg	92.5	75.0	7.5	Elementor: 85/70/5 · Simple: 100/80/10
SEO Fidelity avg	82.5	80.0	68.0	Elementor: 82/95/78 · Simple: 83/65/58
Accessibility avg	84.0	81.5	35.0	Elementor: 82/85/30 · Simple: 86/78/40
Asset Integrity avg	80.0	95.0	0.0	Elementor: 100/100/0 · Simple: 60/90/0. Gemini: WP contamination both tests
Turns to Completion avg	80.0	70.0	55.0	Elementor: 100/80/60 · Simple: 60/60/50
Overall Weighted Score	87.25	82.08	47.80	Average of Elementor + Simple weighted totals
Production Ready? (85%+)	NO (0/2 passed)	NO (0/2 passed)	NO (0/2 passed)	No model passed either test. Sonnet closest (89.6 Elementor, 84.9 Simple)