Fantasy Mode - Full Pipeline

A client submits an image and an optional prompt to the fantasy mode endpoint. The Fantasy Orchestrator validates the pod is running, sends the image to the vision model in Ollama to generate a story, then sends the story as a generation prompt to ComfyUI to produce an accompanying illustration. Both results are assembled and returned to the client.

All diagrams in this document assume the GPU pod is already in READY state.

sequenceDiagram
    autonumber
    participant Client
    participant AuthFilter as Auth Filter
    participant Orchestrator as Fantasy Orchestrator
    participant GpuManager as GPU Lifecycle Manager
    participant GpuOllama as Ollama (GPU Pod)
    participant ComfyUI as ComfyUI (GPU Pod)

    Client->>AuthFilter: POST /api/v1/fantasy/generate {bearer, image_base64, prompt}
    AuthFilter->>AuthFilter: validate JWT
    AuthFilter-->>Orchestrator: authorised, user context attached

    Orchestrator->>GpuManager: getStatus()
    GpuManager-->>Orchestrator: READY
    Orchestrator->>GpuManager: resetIdleTimer()

    Orchestrator->>GpuOllama: POST /api/generate {model: llava, image_base64, prompt: read image and write a fantasy story}
    GpuOllama-->>Orchestrator: {story: generated story text}

    Orchestrator->>ComfyUI: POST /prompt {workflow: image_gen_workflow, prompt: story excerpt for illustration}
    ComfyUI-->>Orchestrator: {job_id: abc123}

    loop poll every 3 seconds until complete
        Orchestrator->>ComfyUI: GET /history/{job_id}
        alt job still running
            ComfyUI-->>Orchestrator: {status: running}
        else job complete
            ComfyUI-->>Orchestrator: {status: complete, filename: output.png}
        end
    end

    Orchestrator->>ComfyUI: GET /view?filename=output.png
    ComfyUI-->>Orchestrator: image bytes

    Orchestrator-->>Client: 200 {status: success, data: {story: text, image_base64: encoded image}}

Fantasy Mode - Partial Pipeline (Image Generation Fails)

The vision and story generation step completes successfully but ComfyUI fails to generate the image. The Fantasy Orchestrator returns a partial result containing the story with an error context indicating the image generation failed. The client receives a 200 with the partial data rather than a 500, allowing the story to be used even without the illustration.

sequenceDiagram
    autonumber
    participant Client
    participant AuthFilter as Auth Filter
    participant Orchestrator as Fantasy Orchestrator
    participant GpuManager as GPU Lifecycle Manager
    participant GpuOllama as Ollama (GPU Pod)
    participant ComfyUI as ComfyUI (GPU Pod)

    Client->>AuthFilter: POST /api/v1/fantasy/generate {bearer, image_base64, prompt}
    AuthFilter-->>Orchestrator: authorised

    Orchestrator->>GpuManager: getStatus()
    GpuManager-->>Orchestrator: READY
    Orchestrator->>GpuManager: resetIdleTimer()

    Orchestrator->>GpuOllama: POST /api/generate {model: llava, image_base64, prompt}
    GpuOllama-->>Orchestrator: {story: generated story text}

    Orchestrator->>ComfyUI: POST /prompt {workflow, prompt}
    ComfyUI-->>Orchestrator: {job_id: abc123}

    loop poll until complete or timeout
        Orchestrator->>ComfyUI: GET /history/{job_id}
        alt ComfyUI error response
            ComfyUI-->>Orchestrator: {status: error, message: generation failed}
            Orchestrator->>Orchestrator: mark image stage as failed, continue with partial result
        end
    end

    Orchestrator-->>Client: 200 {status: success, data: {story: text, image_base64: null, warnings: [{code: IMAGE_GEN_FAILED, message: illustration could not be generated}]}}

OpenWebUI Chat -- Warm Pod

OpenWebUI sends a standard Ollama-compatible chat request to the gateway. The Ollama Proxy confirms the pod is ready, resets the idle timer, and forwards the request transparently to Ollama on the GPU pod. The response is streamed back through the proxy to OpenWebUI. OpenWebUI has no awareness that a proxy sits between it and Ollama.

sequenceDiagram
    autonumber
    participant OpenWebUI as OpenWebUI
    participant AuthFilter as Auth Filter
    participant OllamaProxy as Ollama Proxy Service
    participant GpuManager as GPU Lifecycle Manager
    participant GpuOllama as Ollama (GPU Pod)

    OpenWebUI->>AuthFilter: POST /api/v1/chat {bearer, messages, model, stream: true}
    AuthFilter->>AuthFilter: validate JWT
    AuthFilter-->>OllamaProxy: authorised

    OllamaProxy->>GpuManager: getStatus()
    GpuManager-->>OllamaProxy: READY
    OllamaProxy->>GpuManager: resetIdleTimer()

    OllamaProxy->>GpuOllama: POST /api/chat {messages, model, stream: true}
    GpuOllama-->>OllamaProxy: stream token chunks
    OllamaProxy-->>OpenWebUI: stream token chunks

OpenWebUI Chat - Cold Pod

OpenWebUI sends a chat request but the pod is not running. The Ollama Proxy starts the pod and holds the connection in the Request Queue during warmup. Once the pod is ready the request is forwarded and the response streamed back. OpenWebUI experiences a delay but receives a normal streaming response with no error.

sequenceDiagram
    autonumber
    participant OpenWebUI as OpenWebUI
    participant AuthFilter as Auth Filter
    participant OllamaProxy as Ollama Proxy Service
    participant GpuManager as GPU Lifecycle Manager
    participant Queue as Request Queue
    participant Provider as Provider Port
    participant GpuOllama as Ollama (GPU Pod)

    OpenWebUI->>AuthFilter: POST /api/v1/chat {bearer, messages, model, stream: true}
    AuthFilter-->>OllamaProxy: authorised

    OllamaProxy->>GpuManager: getStatus()
    GpuManager-->>OllamaProxy: STOPPED

    OllamaProxy->>GpuManager: requestStart()
    GpuManager->>Provider: start()
    Provider-->>GpuManager: pod STARTING

    OllamaProxy->>Queue: enqueue(request, clientConnection)

    Note over GpuManager, GpuOllama: Warmup polling runs in background (see Doc 1 Cold Start for full detail)

    GpuOllama-->>GpuManager: health check passes
    GpuManager->>GpuManager: transition to READY
    GpuManager->>Queue: notifyReady()

    Queue->>OllamaProxy: dequeue(request, clientConnection)
    OllamaProxy->>GpuOllama: POST /api/chat {messages, model, stream: true}
    GpuOllama-->>OllamaProxy: stream token chunks
    OllamaProxy-->>OpenWebUI: stream token chunks

OpenWebUI Model List Request

OpenWebUI periodically requests the list of available models to populate its model selector. The Ollama Proxy forwards this to Ollama on the GPU pod if it is running. If the pod is stopped, the proxy returns an empty model list rather than triggering a pod start for a metadata request.

sequenceDiagram
    autonumber
    participant OpenWebUI as OpenWebUI
    participant AuthFilter as Auth Filter
    participant OllamaProxy as Ollama Proxy Service
    participant GpuManager as GPU Lifecycle Manager
    participant GpuOllama as Ollama (GPU Pod)

    OpenWebUI->>AuthFilter: GET /api/tags {bearer}
    AuthFilter-->>OllamaProxy: authorised

    OllamaProxy->>GpuManager: getStatus()

    alt Pod is READY
        GpuManager-->>OllamaProxy: READY
        OllamaProxy->>GpuOllama: GET /api/tags
        GpuOllama-->>OllamaProxy: {models: [{name: llava:13b}, {name: dolphin-llama3}, {name: deepseek-coder}]}
        OllamaProxy-->>OpenWebUI: {models: [{name: llava:13b}, {name: dolphin-llama3}, {name: deepseek-coder}]}
    else Pod is STOPPED or STARTING or WARMING
        GpuManager-->>OllamaProxy: not READY
        OllamaProxy-->>OpenWebUI: {models: []}
    end

Coding Assistant Flow

A client sends a coding request with code context and a question. The Coding Assistant Service applies a coding-specific system prompt and delegates to the Ollama Proxy which forwards to the coding model on the GPU pod. The response is returned to the client. The coding model selection (Deepseek Coder or Qwen2.5 Coder) is driven by configuration.

sequenceDiagram
    autonumber
    participant Client
    participant AuthFilter as Auth Filter
    participant CodingService as Coding Assistant Service
    participant OllamaProxy as Ollama Proxy Service
    participant GpuManager as GPU Lifecycle Manager
    participant GpuOllama as Ollama (GPU Pod)

    Client->>AuthFilter: POST /api/v1/code/assist {bearer, code, question, language}
    AuthFilter->>AuthFilter: validate JWT
    AuthFilter-->>CodingService: authorised

    CodingService->>GpuManager: getStatus()
    GpuManager-->>CodingService: READY
    CodingService->>GpuManager: resetIdleTimer()

    CodingService->>CodingService: prepend coding system prompt to messages
    CodingService->>OllamaProxy: forwardChat(messages, model: deepseek-coder)

    OllamaProxy->>GpuOllama: POST /api/chat {messages with system prompt, model: deepseek-coder, stream: true}
    GpuOllama-->>OllamaProxy: stream code response tokens
    OllamaProxy-->>CodingService: stream tokens
    CodingService-->>Client: stream tokens