Cold Start Flow

A request arrives when the GPU pod is completely off. The gateway validates the JWT, the requesting proxy service detects the pod is stopped and asks the GPU Lifecycle Manager to start it. The request is held in the Request Queue while the pod warms up. The GPU Lifecycle Manager polls requested model on the pod until it responds healthy, then transitions state to READY and drains the queue. The held request is forwarded to requesting service and the response is streamed back to the client. The client experiences latency but receives a complete response with no error.

The below diagrams use Ollama as an example requester:

sequenceDiagram
    autonumber
    participant Client
    participant AuthFilter as Auth Filter
    participant OllamaProxy as Ollama Proxy Service
    participant GpuManager as GPU Lifecycle Manager
    participant Queue as Request Queue
    participant Provider as Provider Port
    participant GpuOllama as Ollama (GPU Pod)

    Client->>AuthFilter: POST /api/v1/chat {bearer, messages, model}
    AuthFilter->>AuthFilter: validate JWT signature and expiry
    AuthFilter-->>OllamaProxy: authorised, user context attached

    OllamaProxy->>GpuManager: getStatus()
    GpuManager-->>OllamaProxy: STOPPED

    OllamaProxy->>GpuManager: requestStart()
    GpuManager->>Provider: start()
    Provider-->>GpuManager: pod accepted, STARTING
    GpuManager-->>OllamaProxy: pod is STARTING

    OllamaProxy->>Queue: enqueue(request, clientConnection)
    Queue-->>OllamaProxy: request held

    par Background warmup polling
        loop every 5 seconds until READY or 3 minute timeout
            GpuManager->>GpuOllama: GET /health
            alt Ollama not yet ready
                GpuOllama-->>GpuManager: no response
            else Ollama ready
                GpuOllama-->>GpuManager: 200 healthy
                GpuManager->>GpuManager: transition WARMING to READY
                GpuManager->>Queue: notifyReady()
            end
        end
    end

    Queue->>OllamaProxy: dequeue(request, clientConnection)
    OllamaProxy->>GpuOllama: POST /api/chat {messages, model}
    GpuOllama-->>OllamaProxy: stream response tokens
    OllamaProxy-->>Client: stream response tokens

Warm Request Flow

A request arrives when the GPU pod is already running and healthy. The Ollama Proxy confirms the pod is READY and forwards the request directly to Ollama without any queuing or warmup delay. The idle timer resets on this request.

sequenceDiagram
    autonumber
    participant Client
    participant AuthFilter as Auth Filter
    participant OllamaProxy as Ollama Proxy Service
    participant GpuManager as GPU Lifecycle Manager
    participant GpuOllama as Ollama (GPU Pod)

    Client->>AuthFilter: POST /api/v1/chat {bearer, messages, model}
    AuthFilter->>AuthFilter: validate JWT signature and expiry
    AuthFilter-->>OllamaProxy: authorised, user context attached

    OllamaProxy->>GpuManager: getStatus()
    GpuManager-->>OllamaProxy: READY

    OllamaProxy->>GpuManager: resetIdleTimer()
    GpuManager-->>OllamaProxy: timer reset

    OllamaProxy->>GpuOllama: POST /api/chat {messages, model}
    GpuOllama-->>OllamaProxy: stream response tokens
    OllamaProxy-->>Client: stream response tokens

Idle Shutdown Flow

No requests have reached the gateway for longer than the configured idle threshold (default 15 minutes). The idle timer checker background process fires and the GPU Lifecycle Manager confirms there are no in-flight requests. The pod is stopped via the Provider Port and the Cost Tracker records the session end. The pod transitions to STOPPED and no compute charges accrue until the next request arrives.

sequenceDiagram
    autonumber
    participant IdleChecker as Idle Timer Checker
    participant GpuManager as GPU Lifecycle Manager
    participant Queue as Request Queue
    participant Provider as Provider Port
    participant CostTracker as Cost Tracker

    IdleChecker->>GpuManager: checkIdleThreshold()
    GpuManager->>GpuManager: compare last activity timestamp to now
    GpuManager-->>IdleChecker: idle threshold exceeded

    IdleChecker->>GpuManager: requestShutdown()
    GpuManager->>Queue: getInFlightCount()
    Queue-->>GpuManager: 0 requests in flight

    GpuManager->>Provider: stop()
    Provider-->>GpuManager: pod STOPPING
    GpuManager->>GpuManager: transition READY to STOPPING

    Provider-->>GpuManager: pod confirmed STOPPED
    GpuManager->>GpuManager: transition STOPPING to STOPPED

    GpuManager->>CostTracker: recordSessionEnd(provider, duration)
    CostTracker-->>GpuManager: session recorded

Idle Shutdown with In-Flight Request Protection

A shutdown is triggered by the idle timer but a request arrives concurrently just before the shutdown completes. The GPU Lifecycle Manager detects the in-flight request and aborts the shutdown, keeping the pod running and resetting the idle timer.

sequenceDiagram
    autonumber
    participant IdleChecker as Idle Timer Checker
    participant GpuManager as GPU Lifecycle Manager
    participant Queue as Request Queue
    participant Provider as Provider Port
    participant Client
    participant OllamaProxy as Ollama Proxy Service

    IdleChecker->>GpuManager: checkIdleThreshold()
    GpuManager-->>IdleChecker: idle threshold exceeded
    IdleChecker->>GpuManager: requestShutdown()

    Note over Client, OllamaProxy: Request arrives concurrently during shutdown check
    Client->>OllamaProxy: POST /api/v1/chat {bearer, messages, model}
    OllamaProxy->>GpuManager: resetIdleTimer()
    OllamaProxy->>Queue: enqueue(request, clientConnection)

    GpuManager->>Queue: getInFlightCount()
    Queue-->>GpuManager: 1 request in flight

    GpuManager->>GpuManager: abort shutdown, reset idle timer
    GpuManager-->>IdleChecker: shutdown aborted, requests in flight

    Note over GpuManager: Pod remains running, request is processed normally

Pod Start Failure Flow

A request arrives and the GPU Lifecycle Manager attempts to start the pod via the Provider Port. All retry attempts are exhausted without a successful start. Queued requests are rejected with a 503 response and the pod transitions back to STOPPED. The failure is surfaced to the client with enough context to retry later.

sequenceDiagram
    autonumber
    participant Client
    participant AuthFilter as Auth Filter
    participant OllamaProxy as Ollama Proxy Service
    participant GpuManager as GPU Lifecycle Manager
    participant Queue as Request Queue
    participant Provider as Provider Port

    Client->>AuthFilter: POST /api/v1/chat {bearer, messages, model}
    AuthFilter-->>OllamaProxy: authorised

    OllamaProxy->>GpuManager: getStatus()
    GpuManager-->>OllamaProxy: STOPPED

    OllamaProxy->>GpuManager: requestStart()
    GpuManager->>Queue: enqueue(request, clientConnection)

    loop 3 attempts with exponential backoff
        GpuManager->>Provider: start()
        alt Provider API error or timeout
            Provider-->>GpuManager: error response
        end
    end

    GpuManager->>GpuManager: retries exhausted, transition to STOPPED
    GpuManager->>Queue: notifyStartFailure(reason)
    Queue->>Queue: reject all held requests

    Queue-->>Client: 503 {status: error, error: {code: POD_START_FAILED, message: pod could not be started, retryAfter: 60}}

Warmup Timeout Flow

The pod starts successfully but Ollama never becomes healthy within the 3-minute warmup window. The GPU Lifecycle Manager considers the warmup failed, stops the pod, and rejects all queued requests.

sequenceDiagram
    autonumber
    participant GpuManager as GPU Lifecycle Manager
    participant Queue as Request Queue
    participant Provider as Provider Port
    participant GpuOllama as Ollama (GPU Pod)
    participant Client

    Note over GpuManager: Pod is in WARMING state, health polling in progress

    loop every 5 seconds for up to 3 minutes
        GpuManager->>GpuOllama: GET /health
        GpuOllama-->>GpuManager: no response
    end

    GpuManager->>GpuManager: warmup timeout exceeded
    GpuManager->>Provider: stop()
    Provider-->>GpuManager: pod STOPPED
    GpuManager->>GpuManager: transition WARMING to STOPPED

    GpuManager->>Queue: notifyWarmupTimeout()
    Queue-->>Client: 503 {status: error, error: {code: WARMUP_TIMEOUT, message: pod failed to become ready}}