Zero-Copy Distributed KV-Cache for Disaggregated LLM Serving
OSDIPublished2024
N. Meters, P. Okonkwo, M. Tanaka
We propose a zero-copy distributed key-value cache architecture for serving large language models across disaggregated GPU clusters. By leveraging RDMA-based memory transfers and a novel page-table abstraction for attention state, our system eliminates serialization overhead during prefill-decode handoffs. Evaluations on a 64-GPU cluster show 2.1x improvement in time-to-first-token and 40% higher throughput compared to state-of-the-art serving frameworks under production trace workloads.
systemsllm-servingdistributed-systemsgpu