Last updated: Mar 15, 2026 · Prompt version: v3.4
This runbook covers the complete end-to-end process for testing an AI model's ability to convert a WordPress page to static HTML using the v3.4 orchestrator prompt. It is designed so that anyone with access to the tools can pick up a test URL and run the full pipeline — orchestrator, Judge Agent, and score recording — without prior context.
The pipeline has three independent stages that run sequentially but in separate conversations (never the same chat):
Stage 1 — Orchestrator Run (the model being tested)
The orchestrator prompt (v3.4) takes a URL and produces a /dist/ folder with the converted static HTML, all assets, and a self-assessment in validation-report.html.
Stage 2 — Judge Agent (a different model from the one tested)
The Judge receives the orchestrator's output artifacts + the original reference screenshots and independently scores the output across 7 dimensions. Returns structured JSON. Run 3 times, take median.
Stage 3 — HITL Review (human)
A human reviews the Judge's scores alongside side-by-side screenshots, confirms or adjusts scores, and records final values in the Notion tracker.
All test work lives inside a master test folder on the local filesystem (or Claude Project files). The structure is:
wp-to-html-tests/
├── tools/ ← Shared tools (install once)
│ ├── node_modules/
│ ├── package.json
│ ├── puppeteer-screenshot.js ← Reusable screenshot script
│ └── content-diff.js ← Reusable content diff script
│
├── prompts/
│ ├── orchestrator-v3.4.md ← Current orchestrator prompt
│ ├── orchestrator-v3.3.md ← Previous version (archived)
│ └── judge-agent-v1.1.md ← Current Judge Agent prompt
│
├── 01-mouat-co/ ← One folder per test URL
│ ├── README.md ← Test metadata (URL, model, date, prompt version)
│ ├── reference/ ← Step 1 screenshots of the ORIGINAL page
│ │ ├── 375.png
│ │ ├── 768.png
│ │ ├── 1024.png
│ │ └── 1440.png
│ ├── runs/
│ │ ├── claude-sonnet-4.6/ ← One subfolder per model tested
│ │ │ ├── dist/ ← The orchestrator's output
│ │ │ │ ├── index.html
│ │ │ │ ├── assets/
│ │ │ │ ├── asset-manifest.json
│ │ │ │ ├── class-replacement-log.json
│ │ │ │ └── validation-report.html
│ │ │ ├── output-screenshots/ ← Screenshots of dist/index.html
│ │ │ │ ├── 375.png
│ │ │ │ ├── 768.png
│ │ │ │ ├── 1024.png
│ │ │ │ └── 1440.png
│ │ │ ├── judge-results/
│ │ │ │ ├── judge-run-1.json
│ │ │ │ ├── judge-run-2.json
│ │ │ │ ├── judge-run-3.json
│ │ │ │ └── judge-median.json
│ │ │ └── run-log.md ← Notes, issues, timing
│ │ ├── gpt-5.4-thinking/
│ │ ├── gemini-3.1-pro/
│ │
│ └── comparison/ ← Side-by-side screenshots for HITL review
│
├── 02-example-simple-page/
├── 03-example-elementor-page/
└── ...
Naming convention: Folders are numbered sequentially: 01-mouat-co, 02-clientname-homepage, 03-example-elementor-page, etc. The number is the permanent test ID referenced in the Test Runs Log on Notion.
To set up a new test:
mkdir 04-newsite-com