Use this page to log scores as you complete test runs. Once results are finalized, a summary will be added to the main WordPress to HTML page.
Visual Likeness 25% · Content Likeness 25% · Interaction Fidelity 10% · SEO Fidelity 10% · Accessibility 5% · Asset Integrity 5% · Turns to Completion 20%
Production threshold: 85%+ overall weighted score required.
| Metric | Weight | Claude Sonnet 4.6 | GPT-5.4 Thinking | Gemini 3.1 Pro | Notes | |
|---|---|---|---|---|---|---|
| Prompt Version | — | v3.4 | v3.4 | v3.4 | ||
| Visual Likeness | 25% | 89 | 82 | 22 | Gemini: two-column post layout at desktop, nav is raw WP mobile menu dump, no separators, wrong content width | Layout, typography, color, responsive (min 85%) |
| Content Likeness | 25% | 100 | 97 | 92 | Gemini: 113/113 nodes matched but header nav links to wp-themes.com, WP HTML comment preserved | Text completeness, order, CTA accuracy (min 95%) |
| Interaction Fidelity | 10% | 100 | 80 | 10 | Gemini: hamburger/close buttons visible but no JS, no submenu toggle, only native form controls work | Accordions, nav, tabs, dropdowns, modals (min 70%) |
| SEO Fidelity | 10% | 83 | 65 | 58 | Gemini: broken heading hierarchy (H1 inside article), no canonical/OG, duplicate skip links | Title, meta, OG, canonicals, JSON-LD, heading hierarchy |
| Accessibility | 5% | 86 | 78 | 40 | Gemini: non-functional hamburger/close mislead AT users, WP classes confuse semantics, broken heading hierarchy | Semantic HTML, ARIA, contrast, keyboard nav |
| Asset Integrity | 5% | 60 | 90 | 0 | Gemini: FAIL — WP contamination in html_classes (hentry, entry-content, open-on-hover-click) + css_properties (--wp--preset--spacing--40/70) | All local paths resolve, no external refs, fonts present |
| Turns to Completion | 20% | 60 | 60 | 50 | Gemini: all orchestrator steps completed but cleanup steps ineffective — WP artifacts remain in final output | 1–3=5pts · 4–5=4pts · 6–7=3pts · 8–9=2pts · 10–12=1pt · 13+=0 |
| Weighted Total | 100% | 84.9 | 79.65 | 47.3 | All FAIL — below 85% threshold |
| Metric | Weight | Claude Sonnet 4.6 | GPT-5.4 Thinking | Gemini 3.1 Pro | Notes | |
|---|---|---|---|---|---|---|
| Prompt Version | — | v3.4 | v3.4 | v3.4 | ||
| Visual Likeness | 25% | 75 | 73 | 18 | No layout structure — all content stacked vertically, page 4x taller than reference, no grid/flexbox, 19-26% pixel mismatch | |
| Content Likeness | 25% | 100 | 98 | 88 | 98% content diff match but hero section duplicated, spurious h1 "Home" injected, raw HTML artifacts in output | |
| Interaction Fidelity | 10% | 85 | 70 | 5 | Zero working interactions; hamburger links to #elementor-action (non-functional in static HTML); no hover states | |
| SEO Fidelity | 10% | 82 | 95 | 78 | Title, meta, OG, JSON-LD all present; broken heading hierarchy (2x h1) | |
| Accessibility | 5% | 82 | 85 | 30 | Skip nav present but nested main elements, footer wrapping page content, 12 images missing alt text | |
| Asset Integrity | 5% | 100 | 100 | 0 | FAIL — massive WP/Elementor contamination: elementor classes, data-element_type attrs, --e-global-* CSS vars, WP menu classes throughout | |
| Turns to Completion | 20% | 100 | 80 | 60 | ~7 pipeline scripts executed (01_scrape through 08_validation) | |
| Weighted Total | 100% | 89.6 | 84.5 | 48.3 | FAIL — below 85% threshold. Visual, content, interaction, and asset integrity all failed minimums |
⚠️ Gemini 3.1 Pro as Judge — Calibration Issues
Gemini was used as a second judge for the Claude Sonnet 4.6 and GPT-5.4 Thinking runs. Two calibration issues were identified:
1. Turns to Completion (Sonnet): Gemini scored 0, treating individual turn count (~35) as agent invocations. The rubric counts session-level invocations (Sonnet had 2 = 100 pts). Opus and GPT-5.4 both scored this 100.
2. Interaction Fidelity (GPT): Gemini scored 100, missing the known hamburger bug (button links to LinkedIn instead of toggling the nav drawer). Opus scored 70, GPT-5.4 scored 70.
Gemini judge scores are recorded in the Judge Reports below but excluded from the primary scoring table. Use Opus and GPT-5.4 judge scores as the authoritative reference.
| Metric | Claude Sonnet 4.6 | GPT-5.4 Thinking | Gemini 3.1 Pro | Notes |
|---|---|---|---|---|
| Visual Likeness avg | 82.0 | 77.5 | 20.0 | Elementor: 75/73/18 · Simple: 89/82/22 |
| Content Likeness avg | 100.0 | 97.5 | 90.0 | Elementor: 100/98/88 · Simple: 100/97/92 |
| Interaction Fidelity avg | 92.5 | 75.0 | 7.5 | Elementor: 85/70/5 · Simple: 100/80/10 |
| SEO Fidelity avg | 82.5 | 80.0 | 68.0 | Elementor: 82/95/78 · Simple: 83/65/58 |
| Accessibility avg | 84.0 | 81.5 | 35.0 | Elementor: 82/85/30 · Simple: 86/78/40 |
| Asset Integrity avg | 80.0 | 95.0 | 0.0 | Elementor: 100/100/0 · Simple: 60/90/0. Gemini: WP contamination both tests |
| Turns to Completion avg | 80.0 | 70.0 | 55.0 | Elementor: 100/80/60 · Simple: 60/60/50 |
| Overall Weighted Score | 87.25 | 82.08 | 47.80 | Average of Elementor + Simple weighted totals |
| Production Ready? (85%+) | NO (0/2 passed) | NO (0/2 passed) | NO (0/2 passed) | No model passed either test. Sonnet closest (89.6 Elementor, 84.9 Simple) |