— A library to enable virtualized, elastic KV cache for LLM serving on shared GPUs

**Jiarong Xing$^{1,2}$, Yifan Qiao$^{1}$, Shan Yu$^{3}$*, Xingqi Cui$^{2}$, and other kvcached contributors**

$^{1}$University of California, Berkeley

$^{2}$Rice University

$^{3}$UCLA

*Project Lead

Date: 2025/10/20

https://github.com/ovg-project/kvcached | https://pypi.org/project/kvcached/ | GPU OS vision paper | Multi-LLM serving paper


<aside> 💡

TL;DR

Behind the $300 billion projected spend on GPU hardware in 2025 lies a dark truth: much of this expensive hardware sits vastly underutilized. The kvcached project is our first step towards reclaiming these underutilized GPUs by building an efficient and readily deployable GPU “OS” as a library for LLM serving on shared GPUs. By exploiting virtual memory abstractions in modern GPUs, kvcached supports elastic and demand-driven KV cache allocation and reclamation, significantly improving GPU utilization under dynamic workloads.

</aside>

The GPU Utilization Crisis

Source: image.

Source: image.

The race to scale AI has triggered historic investment in GPU infrastructure. According to Morgan Stanley, hyperscaler capital expenditures are projected to reach as much as $300 billion in 2025. Yet behind the headlines of record spending lies a quieter story: much of this expensive hardware sits vastly underutilized.

GPU utilization for AI inference workloads often hovers around 20-40%, significantly lower than training tasks. Source.

According to “The State of AI Infrastructure at Scale 2024”, over 75% of organizations report GPU utilization below 70% at peak load. Source.

Optimizing GPU utilization is a major concern for 2024-2025, with the majority of GPUs underutilized even during peak load periods. Source.

The Reasons Behind Low GPU Utilization

There are several factors contributing to low GPU utilization in today’s AI infrastructure.