Use this page to log scores as you complete test runs. Once results are finalized, a summary will be added to the main WordPress to HTML page.


Scoring Weights (v3.2)

Visual Likeness 25% · Content Likeness 25% · Interaction Fidelity 10% · SEO Fidelity 10% · Accessibility 5% · Asset Integrity 5% · Turns to Completion 20%

Production threshold: 85%+ overall weighted score required.


Simple Page

Metric Weight Claude Sonnet 4.6 GPT-5.4 Thinking Gemini 3.1 Pro Notes
Prompt Version
Visual Likeness 25% Layout, typography, color, responsive (min 85%)
Content Likeness 25% Text completeness, order, CTA accuracy (min 95%)
Interaction Fidelity 10% Accordions, nav, tabs, dropdowns, modals (min 70%)
SEO Fidelity 10% Title, meta, OG, canonicals, JSON-LD, heading hierarchy
Accessibility 5% Semantic HTML, ARIA, contrast, keyboard nav
Asset Integrity 5% All local paths resolve, no external refs, fonts present
Turns to Completion 20% 1–3=5pts · 4–5=4pts · 6–7=3pts · 8–9=2pts · 10–12=1pt · 13+=0
Weighted Total 100%

Elementor Page

Metric Weight Claude Sonnet 4.6 GPT-5.4 Thinking Gemini 3.1 Pro Notes
Prompt Version v3.4 v3.4 v3.4
Visual Likeness 25% 75 73 18 No layout structure — all content stacked vertically, page 4x taller than reference, no grid/flexbox, 19-26% pixel mismatch
Content Likeness 25% 100 98 88 98% content diff match but hero section duplicated, spurious h1 "Home" injected, raw HTML artifacts in output
Interaction Fidelity 10% 85 70 5 Zero working interactions; hamburger links to #elementor-action (non-functional in static HTML); no hover states
SEO Fidelity 10% 82 95 78 Title, meta, OG, JSON-LD all present; broken heading hierarchy (2x h1)
Accessibility 5% 82 85 30 Skip nav present but nested main elements, footer wrapping page content, 12 images missing alt text
Asset Integrity 5% 100 100 0 FAIL — massive WP/Elementor contamination: elementor classes, data-element_type attrs, --e-global-* CSS vars, WP menu classes throughout
Turns to Completion 20% 100 80 60 ~7 pipeline scripts executed (01_scrape through 08_validation)
Weighted Total 100% 89.6 84.5 48.3 FAIL — below 85% threshold. Visual, content, interaction, and asset integrity all failed minimums

Overall Average (across both page types)

Metric Claude Sonnet 4.6 GPT-5.4 Thinking Gemini 3.1 Pro
Visual Likeness avg
Content Likeness avg
Interaction Fidelity avg
SEO Fidelity avg
Accessibility avg
Asset Integrity avg
Turns to Completion avg
Overall Weighted Score
Production Ready? (85%+)

Test Runs Log

Run # Date Test URL Page Type Model Prompt V Fidelity Score Turns Token Usage Time (min) Pass/Fail Notes
1 Simple
2 Simple
3 Simple
4 Simple
5 2026-03-15 mouat.co/william-mouat-ux-design/ Elementor Gemini 3.1 Pro v3.4 48.3 ~7 (est) FAIL Scraped raw WP DOM with minimal transformation. Massive Elementor contamination. Self-reported 99% across all dims vs actual 18-88%. Judge: Claude Opus 4.6
6 2026-03-15 mouat.co/william-mouat-ux-design/ Elementor GPT 5.4 Thinking v3.4 84.5 4 (est) 270 FAIL Clean semantic Tailwind HTML, zero WP contamination, 100% content match. Missed by 0.5pts — visual likeness (73) below 85% threshold. Honest self-scoring (visual 70.13 vs judge 73). Judge: Claude Opus 4.6
7 2026-03-15 mouat.co/william-mouat-ux-design/ Elementor Claude Sonnet 4.6 v3.4 89.6 2 (est) 42 FAIL Highest aggregate (89.6). Clean Tailwind HTML, zero WP contamination, 100% content, working hamburger toggle. Failed only on visual (75 vs 85% threshold). 42 min wall clock — 6.4x faster than GPT. Judge: Claude Opus 4.6
8 Elementor

Supplementary Metrics (Unweighted)

Metric Claude Sonnet 4.6 GPT-5.4 Thinking Gemini 3.1 Pro Notes
Code Quality W3C valid, clean CSS, no !important abuse
Hallucination Rate Invented content, links, or schema not in source
Dependency Footprint Unnecessary libraries introduced
Plugin/Dynamic Degradation Graceful handling of WooCommerce, forms, sliders
Lighthouse Score Performance / Accessibility / SEO vs original
Asset Integrity Zero external image/font refs in output
Time to First Usable Output Wall clock time to renderable file
Token Usage API = exact count, Subscription = estimated via tokenizer