Question

What is PagedAttention and how does vLLM use it?

Accepted Answer

PagedAttention is a memory management technique inspired by virtual memory in operating systems. Instead of allocating contiguous memory for the KV cache of each request, it splits memory into fixed-size pages and uses a lookup table. This eliminates fragmentation and lets vLLM serve more concurrent requests with the same GPU memory. It is one of the main reasons vLLM is so popular for self-hosted inference.