Purpose: Minimal set of experiments to lock every maze design and evaluation decision.

All initial experiments on GridWorld

All maze design experiments run on GridWorld only initially. GridWorld is the most mature domain in terms of understanding and implementation. Core design questions (grid size, wall structure, chain depth, distractors, prompting, context, scoring) are properties of the maze JSON spec, not the rendering. If a 10x10 depth-2 chain is the right difficulty, that probably holds regardless of domain.

Once maze design is locked based on the initial set of experiments, a focused cross-domain validation confirms whether the decisions hold in NL and 3D. This avoids getting blocked on maze design due to 3D domain (not yet implemented). NL and 3D domains can start building implementations in parallel with Phase A.

Sequence of experiments

Phase A: GridWorld Only (~4 weeks)

→ Maze design finalized

Phase B: Cross-Domain Validation (~2 weeks, overlaps with maze authoring)

→ Maze design across domains locked

Prompt Bootstrapping

Exps 1 and 2 need a prompt to evaluate models. Exp 3 determines the right prompt. But Exp 3 needs meaningfully difficult mazes from Exp 1. This is a circular dependency.

Resolution: Exps 1 and 2 use a reasonable default prompt - the "Standard" condition (goal statement + mechanism descriptions + valid action list). All Exp 1 and 2 results are explicitly tagged as "default prompt."