Ainfera · Technical Report · v1.2 · Internal — 2026-05-21 — Build contract for ainfera-ai/routing (AIN-188)nn---nnAbstractnn*We present the methodology for Ainfera Routing, a control plane that selects, per AI-agent request, the highest-quality model satisfying the agent's hard budget and latency constraints, then settles and audits the call. Routing is framed as constrained optimization rather than weighted-sum scalarization. The objective's quality term decomposes into a commodity prior, seeded from public benchmarks, and a proprietary empirical term learned from transaction outcomes that only Ainfera observes. Because agent payment, identity, and transport are commoditizing into open standards, the empirical term is the system's sole durable advantage: it compounds with traffic and cannot be reproduced by copying open protocols. We specify the objective, the runtime, the learning strategy, and the constraints required to preserve this advantage.*nn---nn## 1. IntroductionnnAI agents issue inference requests autonomously, under hard budget and latency limits, with no human to select a model or authorize payment. Existing gateways (LiteLLM, OpenRouter, Portkey) optimize single calls for human teams and pass provider keys through without observing outcomes. Ainfera Routing targets the agent regime directly.nnContributions. (i) A constrained routing objective suited to agents' hard limits; (ii) a quality model separating a public prior from a proprietary, compounding empirical term; (iii) a mandatory outcome-capture loop that makes the empirical term trainable; (iv) a defensibility analysis showing routing-outcome data is the only durable advantage once payment and identity commoditize.nnScope: this report specifies Layer L2 (Routing) of the five-layer exchange. Runtime stages are denoted S1–S6 and substrate tiers T1–T7, distinct from product layers L1–L5.nn## 2. Problem formulationnnFor a request x from agent a, select the highest-quality eligible model under hard constraints:nn\\nr(x, a) = argmax_{ m ∈ M_allowed(x,a) } Q(m, x, a)\\n\\ns.t. Σ cost(workflow) + ĉ(m) ≤ budget_envelope(a)\\n l̂(m, x) ≤ latency_target(x)\\nnnThe constrained form is preferred to scalarization (w₁ŝ − w₂c − w₃l), which mixes incommensurable units, cannot reach non-convex regions of the cost–quality frontier, and hides normalization in its weights. M_allowed is produced by a hard compliance veto applied before scoring.nn## 3. Quality modelnn\\nQ(m,x,a) = q_prior(m, task) (public seed)\\n + q_empirical(m,x,a) (proprietary, compounding)\\n + β·σ(m,x) + γ·consistency(m,w)\\n − w_h·h(m,t) − w_r·ρ(m,x)\\nnn| Term | Source | Class |n|---|---|---|n| q_prior | Public intelligence-vs-cost frontier + benchmarks | Commodity |n| q_empirical | Ainfera's (query, model, outcome) records | Moat |n| σ, consistency | Predictor confidence; workflow coherence | Learning / agent-native |n| h, ρ | Live provider health; residual risk | Operational / safety |nnPredictor. q is a first-class subsystem, not a heuristic; ATS/AAMC are post-hoc and cannot serve as pre-generation inputs. Cold start uses a static prior table seeded from the public frontier. Trained, q_empirical is a lightweight cross-attention model over query×model embeddings, predicting quality and cost jointly and generalizing to new models. On the hot path it must be distilled or a cached lookup, never a full forward pass.nn## 4. System architecturennControl plane (policy compilation, learning, evaluation) is off the hot path; the data plane targets sub-30 ms.nn| Stage | Function |n|---|---|n| S1 | Cache check (exact-match default; semantic opt-in) |n| S2 | Candidate set · compliance veto · budget gate |n| S3 | Score Q(m,x,a) |n| S4 | Dispatch to selected provider |n| S5 | Monitor · fallback (re-veto each hop; local model terminal) |n| S6 | Emit audit · outcome · reward records (async) |nnSettlement is multi-protocol (x402 primary, AP2, Stripe MPP); identity supports HTTP-header schemes (Web Bot Auth, TAP) alongside JWS Agent Cards. Budgets enforce per-call, rolling-window, and whole-workflow ceilings.nn## 5. Learningnn| Phase | Mechanism |n|---|---|n| v0 (launch) | Deterministic rule-floor; no learning (insufficient traffic) |n| v1 (post-traffic) | Contextual bandit (LinUCB); rule-floor as low-confidence fallback |n| v2 (DGX) | Learned policy (RL/GNN) distilled to an interpretable hot-path rule |nnReward R = task_success − λ_c·cost − λ_l·latency_overrun, where task_success composes retry signal, success callback, and sampled audit. Decisions are deterministically replayable from (policy_version, x, seed). Policies are versioned and promoted via offline replay → canary → metric gates → rollout.nn## 6. DefensibilitynnPayment, identity, and transport for agents are converging into open commodities (x402, AP2, Visa TAP, Stripe MPP, Web Bot Auth). They are table stakes, and the same open rails permit agents to pay providers directly (disintermediation). The only durable advantage is q_empirical, which compounds with traffic and cannot be reproduced by copying standards. It is zero at launch; the system therefore claims the only architecture that can compound a routing-quality advantage, not an advantage already held. Day-one value over direct access is the constrained decision across the live frontier.nnConstraints (architectural, not optional): (1) no bring-your-own-key — it blinds the outcome loop; (2) every record carries an outcome label; (3) specs and SDKs open, q_empirical and the dataset closed; (4) external claims describe trajectory, not present advantage.nn## 7. ImplementationnnOutcome capture is mandatory from the first commit. Every routed call — including v0, where nothing learns — records the full §16 schema to the L4 audit chain — with the query embedding (the trainable signal for q_empirical) written to a durable feature store and its hash + ref anchored in the chain. The data pipeline precedes the data; deferring it forfeits irreplaceable records and prevents q_empirical from ever training.nn| Phase | Scope |n|---|---|n| v0 | q_prior + health + risk · constrained objective · exact cache · fallback · drain-proof budgets · deterministic replay · outcome pipeline. Rule-floor. Payment sandboxed pre-SG. |n| v1 | q_empirical, exploration, agent affinity, workflow consistency. |n| v2 | Cross-attention predictor, zero-shot model onboarding, distillation to hot-path rule. |nn## NotesnnReviewed across three passes: LLM-routing literature (RouteLLM, BaRP, cross-attention routing, SCORE), AI-gateway competitive landscape, and the 2026 agent-payments landscape. Resolved decisions: constrained objective; LinUCB; exact-match cache; composite reward; public-frontier cold-start prior. Phase 1 complete (AIN-207); implementation under AIN-188.

§16 Outcome-Capture Schema — LOCKED 2026-05-21 (one-shot immutable decision)

The audit chain is append-only + hash-chained, so the §16 record schema cannot be migrated after capture. Locked before real traffic resumes (Manwe pipe currently dead = lucky near-zero-loss window). Grounded in 2026 routing/intent-classification SOTA (ModernBERT semantic router, LLMRank, vLLM semantic router).

Full §16 record (every routed call writes this to L4)

{
  request_id, agent_id,
  task_type,            # reasoning|code|extraction|chat|tool_use|embed|general
  task_type_source,     # caller | classifier | default
  query_embedding_ref,  # id into durable feature store — the trainable signal for q_empirical (§3 cross-attention)
  query_embedding_hash, # sha256 of the embedding vector — immutable integrity anchor in the chain
  embedding_model,      # e.g. nomic-embed-text@v1.5 — vectors comparable; never silently re-embed
  model, provider,
  candidate_set[], vetoed[],
  policy_version,       # "{policy_name}@{semver}+{ruleset_hash[:8]}"  e.g. balanced@1.0.0+a3f9c2e1
  cost_usd, latency_ms, tokens_in, tokens_out,
  outcome,              # success signal (retry/callback/sampled-audit) — may be deferred-null at v0
  cache_hit,
  cell,                 # derived: (task_type × model × constraint_band) — coarse rollup only; real granularity comes from query_embedding
  traffic_origin,       # fleet | external | test — separates dogfood self-traffic from genuine signal (Decision 4)
  fleet_agent           # nullable — which fleet agent, when traffic_origin=fleet (per-agent dogfood analysis)
}

Decision 1 — task_type = caller-supplied with classifier fallback (HYBRID)

task_type = caller_supplied ?? classifier_inference(prompt) ?? "general"

Taxonomy lock (coarse — never over-classify)

reasoning · code · extraction · chat · tool_use · embed · general (7 values). Can split later; can NEVER re-merge cleanly once captured under fine labels. SOTA (vLLM semantic router, local-model router) proves coarse works.

Decision 2 — policy_version = immutable version string at decision time

policy_version = "{policy_name}@{semver}+{ruleset_hash[:8]}"

Decision 3 — query embedding capture (the moat's trainable signal)

Caught pre-traffic: the original §16 lock captured no query representation — which would have made q_empirical (the §3 cross-attention predictor over query×model embeddings) untrainable from history, and the chain has no backfill. Fixed inside the near-zero-loss window.