Problem

Across 6 full runs (16+ trials) of our MAS kernel optimization pipeline, the executor agent (Qwen 3.5) almost never ran a profiler (rocprof or torch.profiler) despite the workflow prompt explicitly requiring it. Instead, it relied entirely on brute-force config search -- guessing kernel parameters without understanding whether bottlenecks were compute-bound or memory-bound.


Phase 1: Initial Ablation (Experiments A and B)

Two experiments on the same task (Kimi K2.5 fused_moe_triton kernel tuning, PR #19228), each running one trial (~60 min) from a clean container.

Primary metric: Did the agent actually execute profiling?

Experiment A Experiment B
Agent ran profiling? No Yes (rocprof --stats)
Profiling insight N/A fused_moe_kernel: 86.9% GPU time
Intervention timing Post-trial (too late) Mid-trial at 5 min (actionable)
Score 74.4 70.4

The only behavioral difference: Experiment B's Claude nudge delivered the profiling command mid-trial, and the agent executed it.


Phase 2: Critical Bug Discovery

Before the follow-up experiments, we found a critical silent bug in _build_stages() in config.py. The YAML used target, duration_minutes, and description, but the parser only read target_ms, timeout_s, and hints. This meant:

This bug affected every previous run. Fixed by adding fallback aliases.


Phase 3: Run 1 -- Config fix + 8h budget

Changes: Fixed config parsing, 90-min trials, targets 25/50/70, 8h budget, combined profiling enforcement.