MemotivaLLM Engineer Interview Questions: Inference Optimization, KV Cache, Speculative Decoding, Quantization

What is PagedAttention and how does vLLM use it?

LLM Engineer Interview Questions: Inference Optimization, KV Cache, Speculative Decoding, Quantization

Audio flashcard · 0:21

Nortren·

What is PagedAttention and how does vLLM use it?

0:21

PagedAttention is a memory management technique inspired by virtual memory in operating systems. Instead of allocating contiguous memory for the KV cache of each request, it splits memory into fixed-size pages and uses a lookup table. This eliminates fragmentation and lets vLLM serve more concurrent requests with the same GPU memory. It is one of the main reasons vLLM is so popular for self-hosted inference.
arxiv.org