**Jiarong Xing$^{1,2}$, Yifan Qiao$^{1}$, Shan Yu$^{3}$*, Xingqi Cui$^{2}$, and other kvcached contributors**
$^{1}$University of California, Berkeley
$^{2}$Rice University
$^{3}$UCLA
*Project Lead
Date: 2025/10/20
https://github.com/ovg-project/kvcached | https://pypi.org/project/kvcached/ | GPU OS vision paper | Multi-LLM serving paper
<aside> 💡
TL;DR
Behind the $300 billion projected spend on GPU hardware in 2025 lies a dark truth: much of this expensive hardware sits vastly underutilized. The kvcached project is our first step towards reclaiming these underutilized GPUs by building an efficient and readily deployable GPU “OS” as a library for LLM serving on shared GPUs. By exploiting virtual memory abstractions in modern GPUs, kvcached supports elastic and demand-driven KV cache allocation and reclamation, significantly improving GPU utilization under dynamic workloads.
</aside>
Source: image.
The race to scale AI has triggered historic investment in GPU infrastructure. According to Morgan Stanley, hyperscaler capital expenditures are projected to reach as much as $300 billion in 2025. Yet behind the headlines of record spending lies a quieter story: much of this expensive hardware sits vastly underutilized.
GPU utilization for AI inference workloads often hovers around 20-40%, significantly lower than training tasks. Source.
According to “The State of AI Infrastructure at Scale 2024”, over 75% of organizations report GPU utilization below 70% at peak load. Source.
Optimizing GPU utilization is a major concern for 2024-2025, with the majority of GPUs underutilized even during peak load periods. Source.
There are several factors contributing to low GPU utilization in today’s AI infrastructure.