Across 6 full runs (16+ trials) of our MAS kernel optimization pipeline, the executor agent (Qwen 3.5) almost never ran a profiler (rocprof or torch.profiler) despite the workflow prompt explicitly requiring it. Instead, it relied entirely on brute-force config search -- guessing kernel parameters without understanding whether bottlenecks were compute-bound or memory-bound.
Two experiments on the same task (Kimi K2.5 fused_moe_triton kernel tuning, PR #19228), each running one trial (~60 min) from a clean container.
Primary metric: Did the agent actually execute profiling?
rocprof command into the next trial's prompt.rocprof --stats command into the container.| Experiment A | Experiment B | |
|---|---|---|
| Agent ran profiling? | No | Yes (rocprof --stats) |
| Profiling insight | N/A | fused_moe_kernel: 86.9% GPU time |
| Intervention timing | Post-trial (too late) | Mid-trial at 5 min (actionable) |
| Score | 74.4 | 70.4 |
The only behavioral difference: Experiment B's Claude nudge delivered the profiling command mid-trial, and the agent executed it.
Before the follow-up experiments, we found a critical silent bug in _build_stages() in config.py. The YAML used target, duration_minutes, and description, but the parser only read target_ms, timeout_s, and hints. This meant:
This bug affected every previous run. Fixed by adding fallback aliases.
Changes: Fixed config parsing, 90-min trials, targets 25/50/70, 8h budget, combined profiling enforcement.