MemotivaLLM Engineer Interview Questions: Inference Optimization, KV Cache, Speculative Decoding, Quantization

What is continuous batching?

LLM Engineer Interview Questions: Inference Optimization, KV Cache, Speculative Decoding, Quantization

Audio flashcard · 0:17

Nortren·

What is continuous batching?

0:17

Continuous batching, also called in-flight batching, is a serving technique where new requests can join the batch as soon as a slot opens, rather than waiting for the entire batch to finish. This dramatically improves throughput in production by keeping the GPU busy. vLLM, TensorRT-LLM, and most modern inference servers implement continuous batching.
docs.vllm.ai