What is speculative decoding?
LLM Engineer Interview Questions: Inference Optimization, KV Cache, Speculative Decoding, Quantization
Audio flashcard · 0:19Nortren·
What is speculative decoding?
0:19
Speculative decoding uses a small fast draft model to propose several tokens at once, then verifies them with the large target model in a single forward pass. Tokens that match are accepted; the first mismatch and onward are discarded. This can speed up generation by two to three times because the verification of multiple tokens is nearly free compared to generating them one at a time.
arxiv.org