A. 🕋 Problem Statement

The use of LLM (Anthropic, OpenAI…) in production encounters for 2 giant issues: Cost and Latency. Traditional caching solutions (Redis, Memcached) works based on exact-match.

However, users often ask vocally different but intentionally similar questions (i.e. “Who’s CEO of Google” and “Google’s current CEO?”). Exact-match solutions fail to accomplish in this case, which results in the costly API calls (latency 2~5s, cost by token).

B. 🧰 Proposed Solution

🕊️ Strix: a high-performance Semantic Cache Proxy

The system converts query statements (user prompt) into vector embeddings in a procedure called vectorization, thus doing the similarity searching to find and return cached results.

C. ♨️ Non-functional Requirements

D. ⚡ Functional Requirements

E. ⛩️ High-level Architecture

image.png

HTTP Gateway - Layer 7 (Golang): Handles HTTP connection pool, parse JSON, manage request lifecycle and timeout.

IPC Contract (gRPC): Defines the Protobuf contract for payload communication through Unix Domain Socket (UDS).