Last updated: Mar 15, 2026 · Prompt version: v3.4


Purpose

This runbook covers the complete end-to-end process for testing an AI model's ability to convert a WordPress page to static HTML using the v3.4 orchestrator prompt. It is designed so that anyone with access to the tools can pick up a test URL and run the full pipeline — orchestrator, Judge Agent, and score recording — without prior context.


1. Architecture Overview

The pipeline has three independent stages that run sequentially but in separate conversations (never the same chat):

Stage 1 — Orchestrator Run (the model being tested)

The orchestrator prompt (v3.4) takes a URL and produces a /dist/ folder with the converted static HTML, all assets, and a self-assessment in validation-report.html.

Stage 2 — Judge Agent (a different model from the one tested)

The Judge receives the orchestrator's output artifacts + the original reference screenshots and independently scores the output across 7 dimensions. Returns structured JSON. Run 3 times, take median.

Stage 3 — HITL Review (human)

A human reviews the Judge's scores alongside side-by-side screenshots, confirms or adjusts scores, and records final values in the Notion tracker.


2. Folder Structure

All test work lives inside a master test folder on the local filesystem (or Claude Project files). The structure is:

wp-to-html-tests/
├── tools/                          ← Shared tools (install once)
│   ├── node_modules/
│   ├── package.json
│   ├── puppeteer-screenshot.js     ← Reusable screenshot script
│   └── content-diff.js             ← Reusable content diff script
│
├── prompts/
│   ├── orchestrator-v3.4.md        ← Current orchestrator prompt
│   ├── orchestrator-v3.3.md        ← Previous version (archived)
│   └── judge-agent-v1.1.md         ← Current Judge Agent prompt
│
├── 01-mouat-co/                    ← One folder per test URL
│   ├── README.md                   ← Test metadata (URL, model, date, prompt version)
│   ├── reference/                  ← Step 1 screenshots of the ORIGINAL page
│   │   ├── 375.png
│   │   ├── 768.png
│   │   ├── 1024.png
│   │   └── 1440.png
│   ├── runs/
│   │   ├── claude-sonnet-4.6/      ← One subfolder per model tested
│   │   │   ├── dist/               ← The orchestrator's output
│   │   │   │   ├── index.html
│   │   │   │   ├── assets/
│   │   │   │   ├── asset-manifest.json
│   │   │   │   ├── class-replacement-log.json
│   │   │   │   └── validation-report.html
│   │   │   ├── output-screenshots/  ← Screenshots of dist/index.html
│   │   │   │   ├── 375.png
│   │   │   │   ├── 768.png
│   │   │   │   ├── 1024.png
│   │   │   │   └── 1440.png
│   │   │   ├── judge-results/
│   │   │   │   ├── judge-run-1.json
│   │   │   │   ├── judge-run-2.json
│   │   │   │   ├── judge-run-3.json
│   │   │   │   └── judge-median.json
│   │   │   └── run-log.md           ← Notes, issues, timing
│   │   ├── gpt-5.4-thinking/
│   │   ├── gemini-3.1-pro/
│   │   
│   └── comparison/                  ← Side-by-side screenshots for HITL review
│
├── 02-example-simple-page/
├── 03-example-elementor-page/
└── ...

Naming convention: Folders are numbered sequentially: 01-mouat-co, 02-clientname-homepage, 03-example-elementor-page, etc. The number is the permanent test ID referenced in the Test Runs Log on Notion.

To set up a new test:

  1. Create the numbered folder: mkdir 04-newsite-com