AI Product Investigation · Started Feb 21, 2026 · Active

Testing whether AI can reliably convert WordPress pages into clean, production-ready static HTML — and what that reveals about workflow design, evaluation methodology, and product potential.

I designed a multi-step conversion pipeline with separate orchestrator and judge agents, built a weighted scoring framework with hard pass/fail gates, and iterated through six major prompt versions. The core finding: visual similarity is easy to fake — the real challenge is building a workflow that can verify content accuracy, structural integrity, and interaction fidelity at a production quality bar.


1. Evaluation Design

Scoring Framework

Each model output is scored on a 0–5 scale across seven weighted dimensions. Scores are calculated per model, per page type (Simple, Elementor), then averaged for an overall score.

Scoring weights (v3.2): Visual Likeness 25% · Content Likeness 25% · Interaction Fidelity 10% · SEO Fidelity 10% · Accessibility 5% · Asset Integrity 5% · Turns to Completion 20%

Hard Thresholds (Must-Pass)

Some dimensions have minimum score requirements regardless of overall weighted total. An output that scores well on visual likeness but drops critical content or breaks accessibility cannot pass.

Weighted Dimensions

Production Threshold

85%+ overall weighted score required for a model to qualify for production use. Models below threshold are not considered regardless of cost.