All client-facing endpoints are synchronous HTTP. For fast operations like pod status, health, and cost queries this is straightforward. For long operations like image generation and story generation, the gateway holds the connection open and streams the response using Server-Sent Events where the upstream supports streaming, and returns a complete response body where it does not. ComfyUI does not support streaming so image generation responses are blocking with a configured timeout.
When a request arrives and the pod is not running, the GPU Lifecycle Manager starts the pod and the Request Queue holds the connection. The queue polls pod status every N seconds. Once status transitions to READY, queued requests are forwarded in arrival order. Requests are not parallelized during drain to avoid overwhelming the pod on initial load.
OpenWebUI sends standard Ollama API calls to the gateway. The gateway intercepts these, manages pod lifecycle transparently, and forwards to real Ollama on the pod. OpenWebUI has no awareness of the GPU lifecycle layer and requires zero modification beyond the base URL setting.
Telegram sends webhook events to a configured HTTPS endpoint on the gateway. The bot service passes the message text to the local Ollama intent classifier, which returns a structured intent and any extracted parameters. The gateway executes the resolved intent and replies via the Telegram Bot API. Long running operations receive an immediate acknowledgement followed by a second reply when the result is ready. This two-message pattern avoids Telegram webhook timeouts.
Both Vast.ai and RunPod are called over HTTPS REST using Bearer token auth. Neither provider SDK is used. All provider calls use Spring WebClient with configured timeouts and retry policies.
All services communicate via direct Java method calls. There is no internal message bus, no internal HTTP, and no shared database in phase one. This keeps the design simple and avoids distributed systems complexity not warranted at current scale.