What is PagedAttention and how does vLLM use it?
LLM Engineer Interview Questions: Inference Optimization, KV Cache, Speculative Decoding, Quantization
Audio flashcard · 0:21Nortren·
What is PagedAttention and how does vLLM use it?
0:21
PagedAttention is a memory management technique inspired by virtual memory in operating systems. Instead of allocating contiguous memory for the KV cache of each request, it splits memory into fixed-size pages and uses a lookup table. This eliminates fragmentation and lets vLLM serve more concurrent requests with the same GPU memory. It is one of the main reasons vLLM is so popular for self-hosted inference.
arxiv.org