DEC-001 Context: The control server needed to be built in a language and framework. FastAPI with Python is the natural fit given the AI ecosystem is Python-native and most provider SDKs are Python-first. Decision: Spring Boot 4.0 with Java 25 was chosen. Alternatives Considered: FastAPI with Python, which gives native access to provider SDKs and easier ML library integration. Consequences: No official Java SDKs exist for Vast.ai or RunPod. All provider communication is implemented via raw HTTP using Spring WebClient. The gateway does not need ML libraries directly since all model execution happens on the GPU pod. Existing Java expertise outweighs the SDK convenience. Spring Boot 4.0 on Java 25 is the current recommended production stack per Spring Framework 7.x guidance.
DEC-002 Context: Two GPU cloud providers were evaluated. A decision was needed on which to support and whether to abstract them. Decision: Both providers are supported behind a common abstraction interface. Provider selection is configuration-driven at runtime. Alternatives Considered: RunPod only (simpler implementation). Vast.ai only (cheaper SA hourly rates). Consequences: The abstraction layer adds implementation effort but protects against provider pricing changes, outages, and GPU availability constraints. Vast.ai offers cheaper GPU hours for SA instances but more expensive persistent storage. RunPod offers stable proxy URLs which simplify the Ollama proxy implementation. Having both available allows cost-optimised selection per session.
DEC-003 Context: OpenWebUI expects to talk directly to an Ollama instance. The gateway needed to sit between OpenWebUI and the real Ollama on the GPU pod without requiring OpenWebUI modification. Decision: The gateway implements an Ollama-compatible API surface so OpenWebUI points at the gateway with no configuration changes beyond the base URL. Alternatives Considered: OpenWebUI custom plugin or middleware (requires forking). Direct Ollama exposure (bypasses pod lifecycle management entirely). Consequences: The gateway must maintain compatibility with Ollama API contracts as Ollama evolves. OpenWebUI works without modification and the pod lifecycle is fully transparent to the user.
DEC-004 Context: A bot interface was needed for pod management and status monitoring. WhatsApp was the preferred platform. Decision: Telegram was chosen as the initial bot platform. Alternatives Considered: WhatsApp via Twilio sandbox (requires account setup and per-message cost). WhatsApp Business API (requires Meta business verification). Unofficial WhatsApp libraries (fragile, against ToS). Consequences: Telegram Bot API is free, instant to set up, and has no per-message cost. The bot service is a pure protocol adapter so migrating to WhatsApp later does not require changes to any other service.
DEC-005 Context: Requests arriving during pod warmup needed handling without dropping them or forcing every client to implement retry logic. Decision: An in-memory request queue holds connections during warmup and drains automatically once the pod is ready. Alternatives Considered: HTTP 503 with retry-after header (pushes retry logic to every client). Persistent queue via Redis (adds infrastructure complexity not warranted at current scale). Consequences: The in-memory queue means requests are lost if the gateway restarts during warmup. This is acceptable at current scale. A maximum wait time prevents unbounded memory growth.
DEC-006 Context: Auth was needed but building full user management before core features work adds risk and delays the initial build. Decision: JWT validation is implemented from day one but token issuance and user management are deferred to post-launch. A static pre-issued token is used during development. Alternatives Considered: No auth until post-launch (unacceptable, gateway controls expensive cloud resources). Full auth with user management from day one (slows initial build significantly). Consequences: The gateway is secured from day one. Adding token issuance and user accounts later requires adding endpoints and a user store but does not require restructuring the validation filter chain.
DEC-007 Context: The system needed a named architectural pattern to guide code organisation and keep business logic testable independently of infrastructure. Decision: Layered Hexagonal Architecture. Business logic in services, all external systems behind interfaces, controllers as thin HTTP adapters only. Alternatives Considered: Simple layered architecture without interface abstractions (simpler but makes provider swapping and testing harder). Full clean architecture with use case classes (more overhead than warranted for this system size). Consequences: Each external system can be mocked in tests without spinning up real infrastructure. Provider implementations can be swapped without touching service logic. The added interface layer is a small upfront cost with significant long term maintainability benefit.
DEC-008 Context: The Telegram bot needed to understand natural language rather than rigid commands. The intent classification model needed to work even when the GPU pod is off. Decision: A small lightweight LLM (phi3:mini or tinyllama) runs in a local Ollama instance on the home server for intent classification only. This is a separate Ollama instance from the GPU pod. Alternatives Considered: GPU pod Ollama for intent classification (unavailable when pod is off, defeats the purpose of the bot). Cloud LLM API for intent classification (introduces external dependency, cost, and privacy concern). Keyword matching only (rigid, breaks on natural phrasing). Consequences: The home server CPU handles a small model with acceptable latency for conversational interaction. The bot operates independently of GPU pod state. The two Ollama instances are managed by separate adapter classes to avoid coupling.