MemotivaLLM Engineer Interview Questions: Inference Optimization, KV Cache, Speculative Decoding, Quantization

What is quantization in LLM inference?

LLM Engineer Interview Questions: Inference Optimization, KV Cache, Speculative Decoding, Quantization

Audio flashcard · 0:20

Nortren·

What is quantization in LLM inference?

0:20

Quantization reduces the precision of model weights from 16 or 32 bits down to 8, 4, or even 2 bits per parameter. This shrinks memory and speeds up inference, with modest quality loss when done carefully. Common methods include GPTQ, AWQ, and GGUF quantization. A 4-bit quantized 70-billion-parameter model can run on a single high-end consumer GPU.
huggingface.co