A client submits an image and an optional prompt to the fantasy mode endpoint. The Fantasy Orchestrator validates the pod is running, sends the image to the vision model in Ollama to generate a story, then sends the story as a generation prompt to ComfyUI to produce an accompanying illustration. Both results are assembled and returned to the client.
All diagrams in this document assume the GPU pod is already in READY state.
sequenceDiagram
autonumber
participant Client
participant AuthFilter as Auth Filter
participant Orchestrator as Fantasy Orchestrator
participant GpuManager as GPU Lifecycle Manager
participant GpuOllama as Ollama (GPU Pod)
participant ComfyUI as ComfyUI (GPU Pod)
Client->>AuthFilter: POST /api/v1/fantasy/generate {bearer, image_base64, prompt}
AuthFilter->>AuthFilter: validate JWT
AuthFilter-->>Orchestrator: authorised, user context attached
Orchestrator->>GpuManager: getStatus()
GpuManager-->>Orchestrator: READY
Orchestrator->>GpuManager: resetIdleTimer()
Orchestrator->>GpuOllama: POST /api/generate {model: llava, image_base64, prompt: read image and write a fantasy story}
GpuOllama-->>Orchestrator: {story: generated story text}
Orchestrator->>ComfyUI: POST /prompt {workflow: image_gen_workflow, prompt: story excerpt for illustration}
ComfyUI-->>Orchestrator: {job_id: abc123}
loop poll every 3 seconds until complete
Orchestrator->>ComfyUI: GET /history/{job_id}
alt job still running
ComfyUI-->>Orchestrator: {status: running}
else job complete
ComfyUI-->>Orchestrator: {status: complete, filename: output.png}
end
end
Orchestrator->>ComfyUI: GET /view?filename=output.png
ComfyUI-->>Orchestrator: image bytes
Orchestrator-->>Client: 200 {status: success, data: {story: text, image_base64: encoded image}}
The vision and story generation step completes successfully but ComfyUI fails to generate the image. The Fantasy Orchestrator returns a partial result containing the story with an error context indicating the image generation failed. The client receives a 200 with the partial data rather than a 500, allowing the story to be used even without the illustration.
sequenceDiagram
autonumber
participant Client
participant AuthFilter as Auth Filter
participant Orchestrator as Fantasy Orchestrator
participant GpuManager as GPU Lifecycle Manager
participant GpuOllama as Ollama (GPU Pod)
participant ComfyUI as ComfyUI (GPU Pod)
Client->>AuthFilter: POST /api/v1/fantasy/generate {bearer, image_base64, prompt}
AuthFilter-->>Orchestrator: authorised
Orchestrator->>GpuManager: getStatus()
GpuManager-->>Orchestrator: READY
Orchestrator->>GpuManager: resetIdleTimer()
Orchestrator->>GpuOllama: POST /api/generate {model: llava, image_base64, prompt}
GpuOllama-->>Orchestrator: {story: generated story text}
Orchestrator->>ComfyUI: POST /prompt {workflow, prompt}
ComfyUI-->>Orchestrator: {job_id: abc123}
loop poll until complete or timeout
Orchestrator->>ComfyUI: GET /history/{job_id}
alt ComfyUI error response
ComfyUI-->>Orchestrator: {status: error, message: generation failed}
Orchestrator->>Orchestrator: mark image stage as failed, continue with partial result
end
end
Orchestrator-->>Client: 200 {status: success, data: {story: text, image_base64: null, warnings: [{code: IMAGE_GEN_FAILED, message: illustration could not be generated}]}}
OpenWebUI sends a standard Ollama-compatible chat request to the gateway. The Ollama Proxy confirms the pod is ready, resets the idle timer, and forwards the request transparently to Ollama on the GPU pod. The response is streamed back through the proxy to OpenWebUI. OpenWebUI has no awareness that a proxy sits between it and Ollama.
sequenceDiagram
autonumber
participant OpenWebUI as OpenWebUI
participant AuthFilter as Auth Filter
participant OllamaProxy as Ollama Proxy Service
participant GpuManager as GPU Lifecycle Manager
participant GpuOllama as Ollama (GPU Pod)
OpenWebUI->>AuthFilter: POST /api/v1/chat {bearer, messages, model, stream: true}
AuthFilter->>AuthFilter: validate JWT
AuthFilter-->>OllamaProxy: authorised
OllamaProxy->>GpuManager: getStatus()
GpuManager-->>OllamaProxy: READY
OllamaProxy->>GpuManager: resetIdleTimer()
OllamaProxy->>GpuOllama: POST /api/chat {messages, model, stream: true}
GpuOllama-->>OllamaProxy: stream token chunks
OllamaProxy-->>OpenWebUI: stream token chunks
OpenWebUI sends a chat request but the pod is not running. The Ollama Proxy starts the pod and holds the connection in the Request Queue during warmup. Once the pod is ready the request is forwarded and the response streamed back. OpenWebUI experiences a delay but receives a normal streaming response with no error.
sequenceDiagram
autonumber
participant OpenWebUI as OpenWebUI
participant AuthFilter as Auth Filter
participant OllamaProxy as Ollama Proxy Service
participant GpuManager as GPU Lifecycle Manager
participant Queue as Request Queue
participant Provider as Provider Port
participant GpuOllama as Ollama (GPU Pod)
OpenWebUI->>AuthFilter: POST /api/v1/chat {bearer, messages, model, stream: true}
AuthFilter-->>OllamaProxy: authorised
OllamaProxy->>GpuManager: getStatus()
GpuManager-->>OllamaProxy: STOPPED
OllamaProxy->>GpuManager: requestStart()
GpuManager->>Provider: start()
Provider-->>GpuManager: pod STARTING
OllamaProxy->>Queue: enqueue(request, clientConnection)
Note over GpuManager, GpuOllama: Warmup polling runs in background (see Doc 1 Cold Start for full detail)
GpuOllama-->>GpuManager: health check passes
GpuManager->>GpuManager: transition to READY
GpuManager->>Queue: notifyReady()
Queue->>OllamaProxy: dequeue(request, clientConnection)
OllamaProxy->>GpuOllama: POST /api/chat {messages, model, stream: true}
GpuOllama-->>OllamaProxy: stream token chunks
OllamaProxy-->>OpenWebUI: stream token chunks
OpenWebUI periodically requests the list of available models to populate its model selector. The Ollama Proxy forwards this to Ollama on the GPU pod if it is running. If the pod is stopped, the proxy returns an empty model list rather than triggering a pod start for a metadata request.
sequenceDiagram
autonumber
participant OpenWebUI as OpenWebUI
participant AuthFilter as Auth Filter
participant OllamaProxy as Ollama Proxy Service
participant GpuManager as GPU Lifecycle Manager
participant GpuOllama as Ollama (GPU Pod)
OpenWebUI->>AuthFilter: GET /api/tags {bearer}
AuthFilter-->>OllamaProxy: authorised
OllamaProxy->>GpuManager: getStatus()
alt Pod is READY
GpuManager-->>OllamaProxy: READY
OllamaProxy->>GpuOllama: GET /api/tags
GpuOllama-->>OllamaProxy: {models: [{name: llava:13b}, {name: dolphin-llama3}, {name: deepseek-coder}]}
OllamaProxy-->>OpenWebUI: {models: [{name: llava:13b}, {name: dolphin-llama3}, {name: deepseek-coder}]}
else Pod is STOPPED or STARTING or WARMING
GpuManager-->>OllamaProxy: not READY
OllamaProxy-->>OpenWebUI: {models: []}
end
A client sends a coding request with code context and a question. The Coding Assistant Service applies a coding-specific system prompt and delegates to the Ollama Proxy which forwards to the coding model on the GPU pod. The response is returned to the client. The coding model selection (Deepseek Coder or Qwen2.5 Coder) is driven by configuration.
sequenceDiagram
autonumber
participant Client
participant AuthFilter as Auth Filter
participant CodingService as Coding Assistant Service
participant OllamaProxy as Ollama Proxy Service
participant GpuManager as GPU Lifecycle Manager
participant GpuOllama as Ollama (GPU Pod)
Client->>AuthFilter: POST /api/v1/code/assist {bearer, code, question, language}
AuthFilter->>AuthFilter: validate JWT
AuthFilter-->>CodingService: authorised
CodingService->>GpuManager: getStatus()
GpuManager-->>CodingService: READY
CodingService->>GpuManager: resetIdleTimer()
CodingService->>CodingService: prepend coding system prompt to messages
CodingService->>OllamaProxy: forwardChat(messages, model: deepseek-coder)
OllamaProxy->>GpuOllama: POST /api/chat {messages with system prompt, model: deepseek-coder, stream: true}
GpuOllama-->>OllamaProxy: stream code response tokens
OllamaProxy-->>CodingService: stream tokens
CodingService-->>Client: stream tokens