Use this page to log scores as you complete test runs. Once results are finalized, a summary will be added to the main WordPress to HTML page.
Visual Likeness 25% · Content Likeness 25% · Interaction Fidelity 10% · SEO Fidelity 10% · Accessibility 5% · Asset Integrity 5% · Turns to Completion 20%
Production threshold: 85%+ overall weighted score required.
| Metric | Weight | Claude Sonnet 4.6 | GPT-5.4 Thinking | Gemini 3.1 Pro | Notes | |
|---|---|---|---|---|---|---|
| Prompt Version | — | |||||
| Visual Likeness | 25% | Layout, typography, color, responsive (min 85%) | ||||
| Content Likeness | 25% | Text completeness, order, CTA accuracy (min 95%) | ||||
| Interaction Fidelity | 10% | Accordions, nav, tabs, dropdowns, modals (min 70%) | ||||
| SEO Fidelity | 10% | Title, meta, OG, canonicals, JSON-LD, heading hierarchy | ||||
| Accessibility | 5% | Semantic HTML, ARIA, contrast, keyboard nav | ||||
| Asset Integrity | 5% | All local paths resolve, no external refs, fonts present | ||||
| Turns to Completion | 20% | 1–3=5pts · 4–5=4pts · 6–7=3pts · 8–9=2pts · 10–12=1pt · 13+=0 | ||||
| Weighted Total | 100% |
| Metric | Weight | Claude Sonnet 4.6 | GPT-5.4 Thinking | Gemini 3.1 Pro | Notes | |
|---|---|---|---|---|---|---|
| Prompt Version | — | v3.4 | v3.4 | v3.4 | ||
| Visual Likeness | 25% | 75 | 73 | 18 | No layout structure — all content stacked vertically, page 4x taller than reference, no grid/flexbox, 19-26% pixel mismatch | |
| Content Likeness | 25% | 100 | 98 | 88 | 98% content diff match but hero section duplicated, spurious h1 "Home" injected, raw HTML artifacts in output | |
| Interaction Fidelity | 10% | 85 | 70 | 5 | Zero working interactions; hamburger links to #elementor-action (non-functional in static HTML); no hover states | |
| SEO Fidelity | 10% | 82 | 95 | 78 | Title, meta, OG, JSON-LD all present; broken heading hierarchy (2x h1) | |
| Accessibility | 5% | 82 | 85 | 30 | Skip nav present but nested main elements, footer wrapping page content, 12 images missing alt text | |
| Asset Integrity | 5% | 100 | 100 | 0 | FAIL — massive WP/Elementor contamination: elementor classes, data-element_type attrs, --e-global-* CSS vars, WP menu classes throughout | |
| Turns to Completion | 20% | 100 | 80 | 60 | ~7 pipeline scripts executed (01_scrape through 08_validation) | |
| Weighted Total | 100% | 89.6 | 84.5 | 48.3 | FAIL — below 85% threshold. Visual, content, interaction, and asset integrity all failed minimums |
| Metric | Claude Sonnet 4.6 | GPT-5.4 Thinking | Gemini 3.1 Pro | |
|---|---|---|---|---|
| Visual Likeness avg | ||||
| Content Likeness avg | ||||
| Interaction Fidelity avg | ||||
| SEO Fidelity avg | ||||
| Accessibility avg | ||||
| Asset Integrity avg | ||||
| Turns to Completion avg | ||||
| Overall Weighted Score | ||||
| Production Ready? (85%+) |
| Run # | Date | Test URL | Page Type | Model | Prompt V | Fidelity Score | Turns | Token Usage | Time (min) | Pass/Fail | Notes |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Simple | ||||||||||
| 2 | Simple | ||||||||||
| 3 | Simple | ||||||||||
| 4 | Simple | ||||||||||
| 5 | 2026-03-15 | mouat.co/william-mouat-ux-design/ | Elementor | Gemini 3.1 Pro | v3.4 | 48.3 | ~7 | (est) | FAIL | Scraped raw WP DOM with minimal transformation. Massive Elementor contamination. Self-reported 99% across all dims vs actual 18-88%. Judge: Claude Opus 4.6 | |
| 6 | 2026-03-15 | mouat.co/william-mouat-ux-design/ | Elementor | GPT 5.4 Thinking | v3.4 | 84.5 | 4 | (est) | 270 | FAIL | Clean semantic Tailwind HTML, zero WP contamination, 100% content match. Missed by 0.5pts — visual likeness (73) below 85% threshold. Honest self-scoring (visual 70.13 vs judge 73). Judge: Claude Opus 4.6 |
| 7 | 2026-03-15 | mouat.co/william-mouat-ux-design/ | Elementor | Claude Sonnet 4.6 | v3.4 | 89.6 | 2 | (est) | 42 | FAIL | Highest aggregate (89.6). Clean Tailwind HTML, zero WP contamination, 100% content, working hamburger toggle. Failed only on visual (75 vs 85% threshold). 42 min wall clock — 6.4x faster than GPT. Judge: Claude Opus 4.6 |
| 8 | Elementor |
| Metric | Claude Sonnet 4.6 | GPT-5.4 Thinking | Gemini 3.1 Pro | Notes | |
|---|---|---|---|---|---|
| Code Quality | W3C valid, clean CSS, no !important abuse | ||||
| Hallucination Rate | Invented content, links, or schema not in source | ||||
| Dependency Footprint | Unnecessary libraries introduced | ||||
| Plugin/Dynamic Degradation | Graceful handling of WooCommerce, forms, sliders | ||||
| Lighthouse Score | Performance / Accessibility / SEO vs original | ||||
| Asset Integrity | Zero external image/font refs in output | ||||
| Time to First Usable Output | Wall clock time to renderable file | ||||
| Token Usage | API = exact count, Subscription = estimated via tokenizer |