AI Product Investigation · Started Feb 21, 2026 · Active
Testing whether AI can reliably convert WordPress pages into clean, production-ready static HTML — and what that reveals about workflow design, evaluation methodology, and product potential.
I designed a multi-step conversion pipeline with separate orchestrator and judge agents, built a weighted scoring framework with hard pass/fail gates, and iterated through six major prompt versions. The core finding: visual similarity is easy to fake — the real challenge is building a workflow that can verify content accuracy, structural integrity, and interaction fidelity at a production quality bar.
Each model output is scored on a 0–5 scale across seven weighted dimensions. Scores are calculated per model, per page type (Simple, Elementor), then averaged for an overall score.
Scoring weights (v3.2): Visual Likeness 25% · Content Likeness 25% · Interaction Fidelity 10% · SEO Fidelity 10% · Accessibility 5% · Asset Integrity 5% · Turns to Completion 20%
Some dimensions have minimum score requirements regardless of overall weighted total. An output that scores well on visual likeness but drops critical content or breaks accessibility cannot pass.
85%+ overall weighted score required for a model to qualify for production use. Models below threshold are not considered regardless of cost.