Project - pegainfer tpot gap investigation

Context

Task owner: ggbond
Repo: /home/node/.openclaw/workspace/pegainfer
Goal: explain and, if safe, reduce the tpot gap between local pegainfer pressure tests (~10ms+) and benchmark measurements (~14ms+).
Execution policy: Codex gets full repo work, branch off main, reproduce, profile, diagnose, and only land framework-level fixes. If the root cause requires major attention changes, stop and report.

Append-only Log

[2026-03-08 17:17 Asia/Shanghai] Claimed the task, created this project page, and prepared 5-minute cron-driven follow-up plus a persistent Codex worker.
[2026-03-08 17:19 Asia/Shanghai] Switched the 5-minute main-session cron wake to this project, updated HEARTBEAT instructions, and started persistent Codex session oc-codex-pegainfer-tpot-gap in /home/node/.openclaw/workspace/pegainfer.
[2026-03-08 17:22 Asia/Shanghai] Recovery + progress: the first acpx prompt launch used an invalid codex flag combination and exited early, but the worker was relaunched successfully. Codex is now actively running in session oc-codex-pegainfer-tpot-gap and has already created branch investigate/tpot-gap.
[2026-03-08 17:38 Asia/Shanghai] First code-level movement detected from Codex on branch investigate/tpot-gap: working tree now includes edits in src/model.rs and src/server_engine.rs (diff stat: 10 insertions, 3 deletions total). Worker is still running, so this looks like an in-progress framework-layer investigation rather than a finished conclusion.
[2026-03-08 17:40 Asia/Shanghai] Manual checkpoint: Codex is not idle; it is actively building the repo (cargo build --release --offline, currently compiling CUDA target gated_delta_rule.cu). Current uncommitted changes now span src/http_server/mod.rs, src/http_server/openai_v1.rs, src/model.rs, and src/server_engine.rs (99 insertions, 7 deletions). The HTTP/server-side diff adds streamed usage support, which may be part of the benchmark-path investigation. I also queued a follow-up asking Codex to explain the hypothesis and whether this actually explains the tpot gap.
[2026-03-08 17:41 Asia/Shanghai] Validation build appears to have completed for the current Codex patch set: no active cargo/nvcc processes remain, and target/release/pegainfer plus release artifacts were updated in the last few minutes. The worker is still running in session oc-codex-pegainfer-tpot-gap with uncommitted edits across src/http_server/mod.rs, src/http_server/openai_v1.rs, src/model.rs, and src/server_engine.rs (current diff stat: 100 insertions, 9 deletions). Still waiting on Codex to explain whether this patch is instrumentation/protocol alignment or an actual tpot-gap fix.
[2026-03-08 17:47 Asia/Shanghai] Major milestone: Codex has cleaned the working tree, committed and pushed branch investigate/tpot-gap at commit dd3c2e6 ("Fix streamed usage accounting and reuse Qwen3 decode graph"), and opened PR #11: https://github.com/xiaguan/pegainfer/pull/11 against main.
[2026-03-08 17:47 Asia/Shanghai] Root-cause checkpoint from Codex: the TPOT gap is mostly benchmark-accounting plus real long-context decode scaling, not just HTTP overhead. Before the fix, live vllm bench serve at 1->256 reproduced about 14.7-14.8ms TPOT; after the patch, the same path dropped to about 13.4ms. Landed safe framework fixes only: reuse the Qwen3 non-streaming decode CUDA graph and emit streamed usage chunks so benchmark clients consume real completion token counts. The remaining gap appears tied to attention/decode scaling, so Codex stopped short of risky kernel work. PR #11 carries the patch and evidence.
[2026-03-08 17:52 Asia/Shanghai] Worker lifecycle update: the persistent Codex session oc-codex-pegainfer-tpot-gap is now dead/terminated; the queue-owner process is defunct. This is not a new blocker for the PR itself: branch investigate/tpot-gap is clean and PR #11 remains OPEN/CLEAN.
[2026-03-08 17:58 Asia/Shanghai] Final wrap-up: release e2e validation passed for both Qwen3-4B (cargo test --release --test e2e -- --nocapture) and Qwen3.5-4B (cargo test --release --test e2e_qwen35 -- --nocapture). PR #11 is now merged into main (merge commit 1fabe79). Conclusion: this change safely fixes benchmark accounting / streamed usage alignment and Qwen3 decode-graph reuse, shrinking the measured 1->256 TPOT gap without touching risky attention kernels; the remaining gap is most consistent with real long-context decode scaling. Cron and heartbeat automation were then disabled.