A request arrives when the GPU pod is completely off. The gateway validates the JWT, the requesting proxy service detects the pod is stopped and asks the GPU Lifecycle Manager to start it. The request is held in the Request Queue while the pod warms up. The GPU Lifecycle Manager polls requested model on the pod until it responds healthy, then transitions state to READY and drains the queue. The held request is forwarded to requesting service and the response is streamed back to the client. The client experiences latency but receives a complete response with no error.
The below diagrams use Ollama as an example requester:
sequenceDiagram
autonumber
participant Client
participant AuthFilter as Auth Filter
participant OllamaProxy as Ollama Proxy Service
participant GpuManager as GPU Lifecycle Manager
participant Queue as Request Queue
participant Provider as Provider Port
participant GpuOllama as Ollama (GPU Pod)
Client->>AuthFilter: POST /api/v1/chat {bearer, messages, model}
AuthFilter->>AuthFilter: validate JWT signature and expiry
AuthFilter-->>OllamaProxy: authorised, user context attached
OllamaProxy->>GpuManager: getStatus()
GpuManager-->>OllamaProxy: STOPPED
OllamaProxy->>GpuManager: requestStart()
GpuManager->>Provider: start()
Provider-->>GpuManager: pod accepted, STARTING
GpuManager-->>OllamaProxy: pod is STARTING
OllamaProxy->>Queue: enqueue(request, clientConnection)
Queue-->>OllamaProxy: request held
par Background warmup polling
loop every 5 seconds until READY or 3 minute timeout
GpuManager->>GpuOllama: GET /health
alt Ollama not yet ready
GpuOllama-->>GpuManager: no response
else Ollama ready
GpuOllama-->>GpuManager: 200 healthy
GpuManager->>GpuManager: transition WARMING to READY
GpuManager->>Queue: notifyReady()
end
end
end
Queue->>OllamaProxy: dequeue(request, clientConnection)
OllamaProxy->>GpuOllama: POST /api/chat {messages, model}
GpuOllama-->>OllamaProxy: stream response tokens
OllamaProxy-->>Client: stream response tokens
A request arrives when the GPU pod is already running and healthy. The Ollama Proxy confirms the pod is READY and forwards the request directly to Ollama without any queuing or warmup delay. The idle timer resets on this request.
sequenceDiagram
autonumber
participant Client
participant AuthFilter as Auth Filter
participant OllamaProxy as Ollama Proxy Service
participant GpuManager as GPU Lifecycle Manager
participant GpuOllama as Ollama (GPU Pod)
Client->>AuthFilter: POST /api/v1/chat {bearer, messages, model}
AuthFilter->>AuthFilter: validate JWT signature and expiry
AuthFilter-->>OllamaProxy: authorised, user context attached
OllamaProxy->>GpuManager: getStatus()
GpuManager-->>OllamaProxy: READY
OllamaProxy->>GpuManager: resetIdleTimer()
GpuManager-->>OllamaProxy: timer reset
OllamaProxy->>GpuOllama: POST /api/chat {messages, model}
GpuOllama-->>OllamaProxy: stream response tokens
OllamaProxy-->>Client: stream response tokens
No requests have reached the gateway for longer than the configured idle threshold (default 15 minutes). The idle timer checker background process fires and the GPU Lifecycle Manager confirms there are no in-flight requests. The pod is stopped via the Provider Port and the Cost Tracker records the session end. The pod transitions to STOPPED and no compute charges accrue until the next request arrives.
sequenceDiagram
autonumber
participant IdleChecker as Idle Timer Checker
participant GpuManager as GPU Lifecycle Manager
participant Queue as Request Queue
participant Provider as Provider Port
participant CostTracker as Cost Tracker
IdleChecker->>GpuManager: checkIdleThreshold()
GpuManager->>GpuManager: compare last activity timestamp to now
GpuManager-->>IdleChecker: idle threshold exceeded
IdleChecker->>GpuManager: requestShutdown()
GpuManager->>Queue: getInFlightCount()
Queue-->>GpuManager: 0 requests in flight
GpuManager->>Provider: stop()
Provider-->>GpuManager: pod STOPPING
GpuManager->>GpuManager: transition READY to STOPPING
Provider-->>GpuManager: pod confirmed STOPPED
GpuManager->>GpuManager: transition STOPPING to STOPPED
GpuManager->>CostTracker: recordSessionEnd(provider, duration)
CostTracker-->>GpuManager: session recorded
A shutdown is triggered by the idle timer but a request arrives concurrently just before the shutdown completes. The GPU Lifecycle Manager detects the in-flight request and aborts the shutdown, keeping the pod running and resetting the idle timer.
sequenceDiagram
autonumber
participant IdleChecker as Idle Timer Checker
participant GpuManager as GPU Lifecycle Manager
participant Queue as Request Queue
participant Provider as Provider Port
participant Client
participant OllamaProxy as Ollama Proxy Service
IdleChecker->>GpuManager: checkIdleThreshold()
GpuManager-->>IdleChecker: idle threshold exceeded
IdleChecker->>GpuManager: requestShutdown()
Note over Client, OllamaProxy: Request arrives concurrently during shutdown check
Client->>OllamaProxy: POST /api/v1/chat {bearer, messages, model}
OllamaProxy->>GpuManager: resetIdleTimer()
OllamaProxy->>Queue: enqueue(request, clientConnection)
GpuManager->>Queue: getInFlightCount()
Queue-->>GpuManager: 1 request in flight
GpuManager->>GpuManager: abort shutdown, reset idle timer
GpuManager-->>IdleChecker: shutdown aborted, requests in flight
Note over GpuManager: Pod remains running, request is processed normally
A request arrives and the GPU Lifecycle Manager attempts to start the pod via the Provider Port. All retry attempts are exhausted without a successful start. Queued requests are rejected with a 503 response and the pod transitions back to STOPPED. The failure is surfaced to the client with enough context to retry later.
sequenceDiagram
autonumber
participant Client
participant AuthFilter as Auth Filter
participant OllamaProxy as Ollama Proxy Service
participant GpuManager as GPU Lifecycle Manager
participant Queue as Request Queue
participant Provider as Provider Port
Client->>AuthFilter: POST /api/v1/chat {bearer, messages, model}
AuthFilter-->>OllamaProxy: authorised
OllamaProxy->>GpuManager: getStatus()
GpuManager-->>OllamaProxy: STOPPED
OllamaProxy->>GpuManager: requestStart()
GpuManager->>Queue: enqueue(request, clientConnection)
loop 3 attempts with exponential backoff
GpuManager->>Provider: start()
alt Provider API error or timeout
Provider-->>GpuManager: error response
end
end
GpuManager->>GpuManager: retries exhausted, transition to STOPPED
GpuManager->>Queue: notifyStartFailure(reason)
Queue->>Queue: reject all held requests
Queue-->>Client: 503 {status: error, error: {code: POD_START_FAILED, message: pod could not be started, retryAfter: 60}}
The pod starts successfully but Ollama never becomes healthy within the 3-minute warmup window. The GPU Lifecycle Manager considers the warmup failed, stops the pod, and rejects all queued requests.
sequenceDiagram
autonumber
participant GpuManager as GPU Lifecycle Manager
participant Queue as Request Queue
participant Provider as Provider Port
participant GpuOllama as Ollama (GPU Pod)
participant Client
Note over GpuManager: Pod is in WARMING state, health polling in progress
loop every 5 seconds for up to 3 minutes
GpuManager->>GpuOllama: GET /health
GpuOllama-->>GpuManager: no response
end
GpuManager->>GpuManager: warmup timeout exceeded
GpuManager->>Provider: stop()
Provider-->>GpuManager: pod STOPPED
GpuManager->>GpuManager: transition WARMING to STOPPED
GpuManager->>Queue: notifyWarmupTimeout()
Queue-->>Client: 503 {status: error, error: {code: WARMUP_TIMEOUT, message: pod failed to become ready}}