Four training runs fine-tuning Qwen3.5-397B-A17B on AMD MI355X GPUs using amdpilot kernel agent trajectories. v1 established the pipeline; v2 introduced 3-view data; v3 fixed the training recipe; v4 fixed a critical data pipeline bug (66% of data silently dropped) and introduced leak-free evaluation.
| Metric | v1 | v2 | v3 | v4 (latest) |
|---|---|---|---|---|
| Key change | Baseline | 3-view data | Recipe fix | Data pipeline fix |
| Effective train examples | ~100 | ~100 (66% dropped) | ~100 (66% dropped) | 270 (all 3 views working) |
| Train loss | 0.163 | 0.085 | 0.059 | 0.199 |
| Eval loss | n/a | n/a | 0.044 (leaked) | 0.055 (clean, held-out) |
| Eval integrity | none | none | 100% leaked | 0% leaked |
| LoRA rank / alpha | 16 / 32 | 16 / 32 | 32 / 64 | 32 / 64 |
| Epochs / Steps | 3 / 18 | 3 / 12 | 10 / 130 | 10 / 200 |
| Training time | 57 min | 1h 32m | 5h 10m | 7h 59m |
| wandb | disabled | v2 | v3 | v4 |
| HuggingFace | v1 | v2 | v3 | v4 |
We discovered that only 100 out of 296 training examples actually reached the model in v2 and v3. The other 196 were silently dropped by LlamaFactory's OpenAI converter due to broken role alternation:
user-role separator message between prefix and suffix. Two consecutive user messages violate the converter's strict alternation rule. 0/92 survived.user-role recap message. Same issue. 0/102 survived.This means the entire 3-view data strategy (the supposed key improvement in v2) was never actually used. v2 and v3 both trained on only full-view trajectories truncated at 32K, losing the final solution in 86/100 cases.
ensure_valid_alternation() to trim trailing tool messages for even countpasses_converter_validation() post-filter to catch remaining edge cases