Use this page to log scores as you complete test runs. Once results are finalized, a summary will be added to the main WordPress to HTML page.
Visual Likeness 25% · Content Likeness 25% · Interaction Fidelity 10% · SEO Fidelity 10% · Accessibility 5% · Asset Integrity 5% · Turns to Completion 20%
Production threshold: 85%+ overall weighted score required.
| Metric | Weight | Claude Sonnet 4.6 | GPT-5.4 Thinking | Gemini 3.1 Pro | Notes | |
|---|---|---|---|---|---|---|
| Prompt Version | — | |||||
| Visual Likeness | 25% | Layout, typography, color, responsive (min 85%) | ||||
| Content Likeness | 25% | Text completeness, order, CTA accuracy (min 95%) | ||||
| Interaction Fidelity | 10% | Accordions, nav, tabs, dropdowns, modals (min 70%) | ||||
| SEO Fidelity | 10% | Title, meta, OG, canonicals, JSON-LD, heading hierarchy | ||||
| Accessibility | 5% | Semantic HTML, ARIA, contrast, keyboard nav | ||||
| Asset Integrity | 5% | All local paths resolve, no external refs, fonts present | ||||
| Turns to Completion | 20% | 1–3=5pts · 4–5=4pts · 6–7=3pts · 8–9=2pts · 10–12=1pt · 13+=0 | ||||
| Weighted Total | 100% |
| Metric | Weight | Claude Sonnet 4.6 | GPT-5.4 Thinking | Gemini 3.1 Pro | Notes | |
|---|---|---|---|---|---|---|
| Prompt Version | — | v3.4 | v3.4 | v3.4 | ||
| Visual Likeness | 25% | 75 | 73 | 18 | No layout structure — all content stacked vertically, page 4x taller than reference, no grid/flexbox, 19-26% pixel mismatch | |
| Content Likeness | 25% | 100 | 98 | 88 | 98% content diff match but hero section duplicated, spurious h1 "Home" injected, raw HTML artifacts in output | |
| Interaction Fidelity | 10% | 85 | 70 | 5 | Zero working interactions; hamburger links to #elementor-action (non-functional in static HTML); no hover states | |
| SEO Fidelity | 10% | 82 | 95 | 78 | Title, meta, OG, JSON-LD all present; broken heading hierarchy (2x h1) | |
| Accessibility | 5% | 82 | 85 | 30 | Skip nav present but nested main elements, footer wrapping page content, 12 images missing alt text | |
| Asset Integrity | 5% | 100 | 100 | 0 | FAIL — massive WP/Elementor contamination: elementor classes, data-element_type attrs, --e-global-* CSS vars, WP menu classes throughout | |
| Turns to Completion | 20% | 100 | 80 | 60 | ~7 pipeline scripts executed (01_scrape through 08_validation) | |
| Weighted Total | 100% | 89.6 | 84.5 | 48.3 | FAIL — below 85% threshold. Visual, content, interaction, and asset integrity all failed minimums |
⚠️ Gemini 3.1 Pro as Judge — Calibration Issues
Gemini was used as a second judge for the Claude Sonnet 4.6 and GPT-5.4 Thinking runs. Two calibration issues were identified:
1. Turns to Completion (Sonnet): Gemini scored 0, treating individual turn count (~35) as agent invocations. The rubric counts session-level invocations (Sonnet had 2 = 100 pts). Opus and GPT-5.4 both scored this 100.
2. Interaction Fidelity (GPT): Gemini scored 100, missing the known hamburger bug (button links to LinkedIn instead of toggling the nav drawer). Opus scored 70, GPT-5.4 scored 70.
Gemini judge scores are recorded in the Judge Reports below but excluded from the primary scoring table. Use Opus and GPT-5.4 judge scores as the authoritative reference.
| Metric | Claude Sonnet 4.6 | GPT-5.4 Thinking | Gemini 3.1 Pro | |
|---|---|---|---|---|
| Visual Likeness avg | ||||
| Content Likeness avg | ||||
| Interaction Fidelity avg | ||||
| SEO Fidelity avg | ||||
| Accessibility avg | ||||
| Asset Integrity avg | ||||
| Turns to Completion avg | ||||
| Overall Weighted Score | ||||
| Production Ready? (85%+) |