Question

What is speculative decoding?

Accepted Answer

Speculative decoding uses a small fast draft model to propose several tokens at once, then verifies them with the large target model in a single forward pass. Tokens that match are accepted; the first mismatch and onward are discarded. This can speed up generation by two to three times because the verification of multiple tokens is nearly free compared to generating them one at a time.