Design Doc | Notion

A. 🕋 Problem Statement

The use of LLM (Anthropic, OpenAI…) in production encounters for 2 giant issues: Cost and Latency. Traditional caching solutions (Redis, Memcached) works based on exact-match.

However, users often ask vocally different but intentionally similar questions (i.e. “Who’s CEO of Google” and “Google’s current CEO?”). Exact-match solutions fail to accomplish in this case, which results in the costly API calls (latency 2~5s, cost by token).

B. 🧰 Proposed Solution

🕊️ Strix: a high-performance Semantic Cache Proxy

The system converts query statements (user prompt) into vector embeddings in a procedure called vectorization, thus doing the similarity searching to find and return cached results.

C. ♨️ Non-functional Requirements

Hardware: Single node, 4 vCPU and 8GB RAM.
Latency: < 50ms total.
Throughput: Minimal 1,000 QPS.
Memory: Hard limit for RAM usage, no OOM.

D. ⚡ Functional Requirements

Proxying: Block user requests sending to LLM Providers.
Vectorization (on premise): Embed an inference model to convert text into vector right at Proxy (local inference).
Semantic Search: Similarity vector search in memory with configurable threshold. For example, Cosine Similarity > 0.9.
Cache Management: Mechanisms for storage and cache eviction when the memory is full.

E. ⛩️ High-level Architecture

HTTP Gateway - Layer 7 (Golang): Handles HTTP connection pool, parse JSON, manage request lifecycle and timeout.

IPC Contract (gRPC): Defines the Protobuf contract for payload communication through Unix Domain Socket (UDS).