Ablation Study - How Qwen3.5 learn to use profiling

Problem

Across 6 full runs (16+ trials) of our MAS kernel optimization pipeline, the executor agent (Qwen 3.5) almost never ran a profiler (rocprof or torch.profiler) despite the workflow prompt explicitly requiring it. Instead, it relied entirely on brute-force config search -- guessing kernel parameters without understanding whether bottlenecks were compute-bound or memory-bound.

Phase 1: Initial Ablation (Experiments A and B)

Two experiments on the same task (Kimi K2.5 fused_moe_triton kernel tuning, PR #19228), each running one trial (~60 min) from a clean container.

Primary metric: Did the agent actually execute profiling?

Experiment A (Post-trial hard-check): After each trial, the orchestrator greps agent output for profiling keywords. If none found, it injects a concrete rocprof command into the next trial's prompt.
Experiment B (Real-time Claude nudge): The nudge agent (Claude Opus 4.6 via AMD proxy) monitors in real-time. At the 5-minute mark, if no profiling activity is detected, it writes an urgent nudge with the exact rocprof --stats command into the container.

	Experiment A	Experiment B
Agent ran profiling?	No	Yes (rocprof --stats)
Profiling insight	N/A	fused_moe_kernel: 86.9% GPU time
Intervention timing	Post-trial (too late)	Mid-trial at 5 min (actionable)
Score	74.4	70.4

The only behavioral difference: Experiment B's Claude nudge delivered the profiling command mid-trial, and the agent executed it.

Phase 2: Critical Bug Discovery

Before the follow-up experiments, we found a critical silent bug in _build_stages() in config.py. The YAML used target, duration_minutes, and description, but the parser only read target_ms, timeout_s, and hints. This meant:

All stage targets defaulted to 999 (stages never auto-advanced)
All stage timeouts defaulted to 3600s (ignoring configured durations)
All stage descriptions were empty (agent never received per-stage instructions like "profile first" or "focus on small batch sizes")

This bug affected every previous run. Fixed by adding fallback aliases.

Phase 3: Run 1 -- Config fix + 8h budget

Changes: Fixed config parsing, 90-min trials, targets 25/50/70, 8h budget, combined profiling enforcement.