All always-on components run on a personal home server (HP EliteDesk 800, i5 6th gen). This machine hosts Zoltraak Gateway, OpenWebUI, and a local Ollama instance used exclusively by the Telegram bot for natural language intent classification. The home server does not have a GPU and all model inference on this machine runs on CPU using small lightweight models suitable for intent classification only.

The GPU pod is a separate ephemeral machine running on Vast.ai or RunPod. It is started on demand by the gateway and stopped after idle timeout. The pod runs Ollama and ComfyUI as background processes. Models are stored on a persistent volume that survives pod stop and restart.

graph LR
    subgraph Home Server - Always On
        GW[Zoltraak Gateway]
        OWU[OpenWebUI]
        LOLLAMA[Ollama<br>CPU only<br>Bot brain]
    end

    subgraph On Demand GPU Pod
        OLLAMA[Ollama<br>GPU]
        COMFY[ComfyUI<br>GPU]
        VOL[(Persistent Volume<br>Model Storage)]
    end

    GW <-->|REST| OWU
    GW <-->|Intent classification| LOLLAMA
    GW -->|Starts / Stops| POD
    GW <-->|REST| OLLAMA
    GW <-->|REST| COMFY
    OLLAMA --- VOL
    COMFY --- VOL